CitySim Diffusion Engine Could Redefine How We Test Traffic

Cities aren’t just collections of streets and buildings; they’re living systems where thousands of tiny decisions ripple into delays, risks, and shared moments of courage or fatigue. For autonomous vehicles, understanding that living city is the key to safety and trust. A new study from Waymo and UT Austin launches a bold bet: build CitySim, a city-scale traffic simulator powered by a single, end-to-end generative world model called SceneDiffuser++. The goal isn’t a few seconds of plausible driving, but long, trip-level rollouts that feel like watching a city breathe. From the initial scatter of cars and pedestrians to the moment a stray cyclist vanishes behind a bus and reappears, CitySim aims to capture the twists and turns of real-world urban mobility at scale.

Behind the project lurks a practical dream: test autonomous-vehicle software not just on curated clips but across immersive, sprawling journeys. The authors—led by Shuhan Tan of Waymo, with collaborators from UT Austin—argue that traditional simulators are either short, event-based clips or rely on handcrafted rules, which miss the messy, dynamic, long-horizon nature of city driving. CitySim, they propose, should be a single, cohesive model that can generate the scene, animate the agents, reason about occlusions, spawn and remove new players as needed, and drive the environment (traffic lights, signals, and other elements) all in a unified loop. It’s a tall order, but if they pull it off, the payoff could be huge: a way to stress-test AV software across countless kilometers of synthetic, yet believable, city life.

It’s easy to nod at the vision and move on, but the paper grounds this dream in concrete challenges that have dogged earlier work. When a simulated route diverges from real-world data, you can’t simply replay the original logs—new agents must appear, others must disappear, and traffic signals must behave coherently with the evolving scene. That requires not just modeling vehicles but modeling the city as a living, responsive system. SceneDiffuser++ treats the scene as a set of tensors—one for agents, one for traffic lights, and possibly more—each evolving over time. The trick is to learn a joint, diffusion-based model that can denoise a noisy, future-laden scene into a plausible past, present, or future configuration, all while allowing dynamic agent generation and occlusion reasoning. In other words, CitySim wants to be a real-time, long-horizon storyteller for a city’s traffic. The work behind it is a collaboration that signals a shift in how we think about simulation—from resetting a scene and replaying it to letting a single model continuously narrate a city in motion.

To place this in context: the team used the Waymo Open Motion Dataset and an expanded map dataset (WOMD-XLMap) to train and validate their approach. They push beyond the confines of logged data by letting the model decide which agents exist in a scene at a given moment, where they appear, and how they disappear, all while maintaining realism in speed, spacing, and interactions. The result, they argue, is a city-scale simulator that can run long trips with a believable density of agents, varied traffic-light states, and dynamic occlusions—things that are essential for evaluating how AV software would actually perform in a living city. The paper’s central pitch is not that this is a perfect replica of reality, but that it’s a controllable, learnable world model that can generate realistic long rollouts with a single training objective. This is the kind of simplicity that could unlock more trustworthy test scenarios and, ultimately, safer autonomous driving.

A Unified Dream: CitySim and Trip-Level Simulation

CitySim isn’t just a tool; it’s a philosophy about scale. The authors formalize the aim as trip-level simulation: you give the system a city region, a starting point, and a destination, and it populates the space with dynamic agents and an evolving environment, letting the ego vehicle travel through the scene as if it were real life. The advantage is clear: you can capture how long-horizon events unfold—how traffic light cycles interact with signal timing, how pedestrians thread through intersections, how vehicles merge around a blocked lane—without being tethered to a fixed dataset. Trip-level realism matters because many safety and performance issues only reveal themselves over longer horizons: how does an AV handle a rare but dangerous interaction with an ambulance, or a sudden pedestrian dart that cascades into a traffic jam? CitySim is designed to probe those sequences in a controlled, repeatable way, with the capacity to stress-test failure modes in a sandbox that looks and feels like a real city.

The study highlights three capabilities that are routinely taken for granted in short clips but become essential in longer rollouts: dynamic agent generation (agents can spawn and disappear as needed), occlusion reasoning (the model understands when something is hidden and when it reappears), and environment dynamics (traffic lights and other environmental factors behave coherently with the evolving scene). Put simply, you need a model that can handle not just “who is on the road now” but “who might pop into existence next, where they will be, and how the world around them will respond.” CitySim frames these as part of a single, end-to-end learned system, a diffusion-based world model that can be rolled out autoregressively. The result is a simulation that doesn’t just feel plausible in snapshots but sustains plausibility across minutes of city life—precisely the kind of long view modern AV planners crave.

The paper also makes a candid case for why this is different from past work. Earlier diffusion-based simulators tended to fix a scene’s agent set or assumed ground truth validity for agents, which limited the duration of plausible rollouts. CitySim treats agent validity (whether an agent is present and observable) as an explicit part of the model, so agents can appear and vanish in the scene without breaking the narrative. It also brings in traffic lights as first-class elements with their own states and transitions, rather than treating signals as static or known from a logged dataset. It’s a small shift in modeling granularity, but the authors show that it makes a big difference for trip-level realism. The paper’s claim is bold: a single, jointly trained model can initialize, evolve, and render a city-scale scene over long horizons, while staying faithful to the kinds of agent interactions and environmental cues we see in the real world.

SceneDiffuser++: The One Model to Rule the Rollout

The core technical move is to upgrade and unify diffusion-based world modeling into a multi-tensor, end-to-end framework. Think of the city as a handful of layers: one tensor for all agents (cars, pedestrians, bicycles, and the ego vehicle), another tensor for traffic lights, and potentially others for road geometry or weather cues. Each tensor has a variable number of elements and a fixed set of features. The model learns to denoise a noisy, future-looking version of these tensors back into a coherent scene. Crucially, it does this across time steps, so you can predict a sequence of frames that preserves plausible trajectories and interactions while accommodating new agents and occlusions.

To make this work, the authors introduce a sparse-tensor learning approach. In a city-scale scene, only a subset of agents is present at any moment, and many potential elements may be occluded or invalid. The model thus predicts a “validity” channel alongside the usual agent features (position, size, velocity, type) or traffic-light states. The training trick is elegant: during training, they mask invalid entries to zero and supervise only the valid parts, while during inference they use a soft-clipping mechanism that blends the predicted, denoised values with a zero baseline wherever an element is deemed invalid. This keeps the model from being overwhelmed by spurious or non-existent agents and helps it learn to insert and remove agents smoothly without collapsing into unrealistic clutter or, conversely, freezing the scene with too many fixed agents. It’s the kind of practical hack that makes a complicated idea work in the messy, long-run reality of a simulation.

Another notable architectural move is the multi-tensor transformer backbone. By homogenizing the different scene tensors (agents and lights) into a shared latent space, the model can reason about the joint evolution of these disparate scene elements. The result looks like a chorus of actors on a stage: the cars, the pedestrians, the traffic signals, and even the road geometry all interact in a coherent, learned narrative. The system’s generative power isn’t just about “what could happen” to a single car, but about the emergent choreography of a city over dozens or hundreds of simulated seconds. The authors describe this as a step toward a truly generative world model that can take a map and turn it into a living city for a vehicle stack to test against, without needing specialized rules for every tiny interaction.

On the evaluation side, the team borrows a practical lens: they measure realism with histograms of key metrics over sliding windows of the rollout, then compare the simulated distributions to those observed in the logged data using Jensen–Shannon divergence. The metrics cover how many agents are valid at any moment, how many enter and exit during a window, how far the entering or exiting agents are from the AV, off-road rates, collision rates, average speed, and traffic-light violations and transitions. The results are telling: SceneDiffuser++ consistently lowers the divergence across most of these measures compared with earlier diffusion-based models and traditional planners like IDM (Intelligent Driver Model), especially for dynamic events like insertion/removal of agents and realistic traffic-light behavior. In short, the model doesn’t just look plausible in isolated frames; it preserves a believable city-level tempo over longer rollouts.

One striking visualization from the paper shows a 60-second rollout where the model inserts agents into parking lots, has them emerge onto the main road, and smoothly obey traffic lights, with a distribution of visible agents that shifts realistically over time. Another comparison demonstrates how standard SceneDiffuser tends to “lock” agents into the scene, causing unrealistic stagnation, while SceneDiffuser++ sustains a livelier, more varied urban tapestry. The researchers are transparent about limits—off-road rates, for instance, rise in some configurations when many new agents appear in the scene—but they also show how adjusting replanning frequencies and horizon lengths can mitigate some of these issues. The upshot is a more faithful, more controllable long-horizon simulation that better reflects the complexity of real cities.

Why This Matters Now: Safety, Planning, and Real-World Impact

If CitySim delivers on its promise, it could become a new backbone for autonomous-vehicle safety validation and software development. Traditional AV testing relies heavily on logged data, curated scenarios, and short, bounded simulations. CitySim offers a route to enormous, virtual test miles that preserve realism in how agents appear, disappear, and interact, as well as how traffic signals shape decisions. In practical terms, this could translate into more comprehensive stress-testing before a feature is pushed to a fleet, enabling safer deployment in more varied urban contexts. The authors emphasize trip-level evaluation, which means you can quantify end-to-end metrics like trip travel time, pickup and drop-off quality, safety rates, and system-level fault discovery—essential for understanding how a new perception stack, planning module, or control policy would perform in the wild over time.

There’s a second, equally important implication: better, more diverse synthetic data can complement real-world data, reducing the need for expensive, time-consuming driving miles, and enabling tests that explore corner cases that are improbable in logs but plausible in the real world. By learning a single, coherent model that can initialize, evolve, and layout a city-scale scene, SceneDiffuser++ lowers the barrier to exploring counterfactuals—“what if a stop line is occluded for longer?” or “what if a pedestrian steps off the curb earlier than expected?”—in a controlled, repeatable fashion. This isn’t about replacing human driving studies; it’s about expanding the laboratory where AV engineers can probe, validate, and iterate on their software with a broader array of scenarios than a handful of edge cases in a log file.

The study also makes a clear case for collaboration across institutions: the work is anchored in Waymo’s data and engineering ecosystem, with a strong collaborating voice from UT Austin. The authors, led by Shuhan Tan, present CitySim as a realistic milestone in a longer arc of world-model research—one that began with dreams of “learning to simulate the world” and has now matured into tools that can be trusted to generate coherent, long-form city behavior. This is the sort of research that bridges the gap between elegant theory and practical engineering, where a clever loss function and a few architectural adaptations translate into a tool that could change how we test, compare, and improve autonomous-driving software at scale.

Of course, there are caveats. The model’s strength—dynamic agent generation and long-horizon realism—also creates opportunities for mischief if not handled carefully: how do we ensure the synthetic city doesn’t become a playground for unsafe, unrealistic behavior? The authors acknowledge that balancing realism, controllability, and safety remains a delicate act, and they emphasize the need for careful evaluation across diverse settings, replanning cadences, and horizon lengths. They also note that the current setup does not condition on explicit goals or routes for the ego vehicle, leaving room for future work to integrate CitySim with goal-directed planners and more explicit routes. In short, CitySim is a powerful step forward, not a final destination, and its true impact will depend on how it’s used by researchers and engineers as they push AV safety and reliability forward in the chaotic, thrilling laboratory that is a modern city.

As for the people behind the project, the study is a collaboration between Waymo LLC and UT Austin, with Shuhan Tan (and colleagues such as John Lambert, Hong Jeon, Sakshum Kulshrestha, Yijing Bai, Jing Luo, Dragomir Anguelov, Mingxing Tan, and Chiyu Max Jiang) cited among the authors. Their collective effort marks a notable moment in the field: a single diffusion-based world model that can carry a city through a long, plausible journey, from the first spark of an initial scene to the final, traffic-aware destination. It’s a reminder that in AI-driven simulations, the boundary isn’t just about making things look right; it’s about teaching a model to understand a city’s tempo—the way a red light becomes green, the way a pedestrian appears just as a car begins to turn, the way a cluster of parked cars turns into a dynamic, interacting crowd the moment a curbside bus pulls away.

What’s Next for CitySim and the World of Traffic AI

Lurking in the margins of the paper are hints of what comes next: more explicit, goal-conditioned planning that works hand in hand with the world model, better integration of weather and lighting conditions, and broader tests on city-scale maps that push the model’s generalization beyond the WOMD-XLMap. If the community can integrate CitySim with planning stacks that are also learning-based, we could envision a feedback loop where planners and world models improve each other in a virtuous cycle, driving safer, more robust autonomous systems. The authors’ emphasis on a single, unified loss that drives the entire pipeline is a provocative design choice; it invites us to reimagine simulation not as a patchwork of components but as a single, coherent knowledge system that can learn, adapt, and roll out with fewer hand-tuned edges. In the end, CitySim’s promise is simple and compelling: a city-scale sandbox where the future of autonomous driving can be imagined, tested, and improved with something that feels almost like intuition—only built from data, not just imagination.

For readers who follow the arc of autonomous-vehicle research, CitySim and SceneDiffuser++ feel like a hinge moment. They show how advances in diffusion models, multi-tensor architectures, and creative loss design can scale up to a living, breathing urban environment. The project is a clear invitation: if we want safer, more reliable AV systems, we’ll need to test them in long, diverse, and realistically crowded simulations. CitySim isn’t just another simulator; it’s a blueprint for a new kind of city-scale lab, where the traffic light’s glow, the hum of engines, and the sudden swerve of a pedestrian are all part of a shared narrative that engineers can study, critique, and improve upon—one kilometer, then another, then another, until the stories of our cities finally align with the safety and trust we demand from technology that moves us forward.

CitySim could become a new standard in how we validate, compare, and refine autonomous-vehicle software, turning trip-level realism from a thorny aspiration into a practical, testable capability. And as the authors remind us, the journey from a single model to a city-scale, testable reality is as much about courage—trying something big and risky—as it is about careful science: thoughtful metrics, rigorous evaluation, and a willingness to iterate toward a safer, more trustworthy future for urban mobility.

The study’s origin—a collaboration between Waymo LLC and UT Austin, led by Shuhan Tan—anchors the work in a real-world engineering context, where the goal is not just theory for theory’s sake but a tool that could help fleets, policies, and safety standards evolve in step with our cities. If the promise holds, CitySim could become a shared instrument for researchers, planners, and engineers to invent, test, and compare the next generation of driving technology in a city that never stops teaching us how to move more safely, more efficiently, and with a little more grace.