The roar of a boat cutting through water, the creak of rigging, the distant birdsong—the kinds of sounds that make a screen moment feel real are often the quiet hero of cinema’s mood. Foley artists do that work in real life, crafting and layering sounds to match what we see on screen. It’s a craft as old as film, and it’s famously hands-on: a studio full of props, materials, and instinct, stitched together with a sound designer’s attention to timing and texture. Now a team of researchers from Sony AI has proposed a new way to automate a big chunk of that art without sacrificing control or quality. They frame video-to-audio generation as a step-by-step process, where each audible event gets its own track before the final mix.
The work, led by Akio Hayakawa and colleagues at Sony AI—the research arm of Sony Group Corporation—takes Foley’s modular spirit and translates it into a machine-learning workflow. The paper’s authors include Masato Ishii, Takashi Shibuya, and Yuki Mitsufuji, among others, all pushing to make high-quality, semantically aligned audio that accompanies video more accessible, repeatable, and collaborative. Their result isn’t a single slick soundtrack, but a method to build one piece at a time, just like a Foley artist would, while staying anchored to the visuals and to the creator’s written or spoken prompts.
Step-by-step Foley enters the digital age
Imagine trying to recreate a scene’s soundscape by guessing what fits best all at once. That’s how many video-to-audio systems work today: they generate a complete audio track in one shot, conditioned on the video and perhaps a text prompt. If something in that track doesn’t quite land, you’re stuck regenerating the whole thing, which can be inefficient and frustrating in a collaborative setting. The Sony AI approach reframes the task as a sequence: first render the most salient sound event, then add another, and another, until the composite audio captures all the events the video implies. It’s Foley as a multi-step dialogue between machine and creator.
Concretely, the researchers introduce a framework in which each step targets a distinct sound concept. The model uses a text prompt to describe the event it should add, and it conditions its generation on previously produced audio tracks. The result is a set of semantically distinct audio streams that can be mixed to produce the final soundtrack. A crucial insight they borrow from prior compositional-generation work is the idea of negation: to create a new track, the model is guided to avoid reproducing concepts it already produced. That’s how the second sound stays cleanly separate from the first.
The project isn’t about reinventing years of Foley practice; it’s about capturing that same modular logic in a system that can run on standard datasets and pre-trained video-to-audio models. In other words, you don’t need a warehouse of multi-track Foley samples to train a specialized model. You train a capable base model once, then teach it to add or subtract sound concepts step by step. The approach also opens the door to close collaboration between human sound designers and AI—humans can steer the prompts, and the system can propose concrete, track-level contributions on demand.
How Negative Audio Guidance works
At the heart of the method is a concept with a mouthful of a name: Negative Audio Guidance, or NAG. It borrows a powerful idea from the broader world of generative models—concept negation. The idea is simple in spirit: when you generate the next sound track, you push the model away from what has already been generated. If the first track is a moose’s snort and a splashing stream, the second track should avoid re-creating those noises and instead fill in a different sound event, like wind rustling through leaves or distant birds. The result is a cleaner separation of sounds, making it easier for a mix engineer to balance levels and space in the final audio.
Technically, the team builds on a flow-matching diffusion framework. In this world, the model learns to predict how to morph noise into a realistic audio sample, guided by the video and a text prompt. When adding a new track, the model’s guidance combines three ideas: the unconditional flow, a term that ties the generation to the video and the current prompt, and a negation term that nudges the generation away from the already-produced sounds. In practice, this means the system treats the new track as a fresh concept space, conditioned on both what’s in the frame and what sounds have already appeared.
To implement this without asking for specially curated multi-track datasets—which would be hard to come by—the researchers designed a training approach that leverages existing text-video-audio datasets. They couple a pre-trained video-to-audio model with a trainable flow-estimator module that can incorporate audio conditioning. In short, they retrofit a strong existing model with a small, flexible nerve center that can handle the new, step-by-step logic. That nerve center is trained on datasets where the audio pieces don’t necessarily correspond to fixed tracks, which makes the approach practical for real-world data.
Another practical piece is how they handle mixing. They propose a simple composition by summing the step-generated tracks and normalizing loudness. It’s a pragmatic starting point, not a final word on how to balance competing sound sources in every scene. The authors acknowledge that scene-by-scene tuning—perhaps with a learned mixing model—could yield even more natural results. The core contribution, though, is the controllable, modular generation process that keeps the tracks distinct and faithful to their prompts.
Why this could reshape media and imagination
There’s something existentially satisfying about turning Foley into a software-enabled, stepwise craft. The proposed method makes it feasible to generate multiple, semantically distinct audio tracks for a single video. You can imagine a filmmaker or game-sound designer asking the AI to conjure a “foreground” sound like moose snorts and a separate “background” layer such as wind through pines or distant river rush. When the tracks are then mixed, the final soundscape can feel more layered, more deliberate, and more editable—without having to start from scratch every time a single element is off.
That modularity isn’t just a labor saver; it changes what counts as a deliverable in audio post-production. Right now, the process often ends in a single, monolithic track that must satisfy a lot of constraints at once. With step-by-step generation, a creator can dial in a sound event, listen, adjust the text prompts, and push the AI to refine only the piece that needs work. It’s closer to the Foley room’s iterative rhythm: propose, test, tweak, repeat. The promise is audibly richer soundscapes that stay in sync with action and emotion, even in fast-moving scenes or complex environments.
The work also hints at broader implications for accessibility and creativity. If machines can compose plausible soundtracks from simple prompts, studios could democratize high-quality sound design, enabling indie filmmakers and independent creators to level up their audio without big teams or big budgets. And as the approach improves, it could empower interactive media—VR films, immersive storytelling, or live performances—where sounds need to adapt on the fly to changes in action or audience feedback.
Of course, there are caveats. The paper openly discusses limitations: some prompts still yield hums, noise, or artifacts, and alignment with text prompts can drift. The authors are frank about the fact that the quality and trustworthiness of the audio ultimately hinge on the base model’s capabilities and the data it was trained on. Still, the step-by-step, negation-guided approach offers a clear path to more controllable and modular sound design, rather than a black-box, end-to-end generator.
Why this matters beyond the cinema
Beyond movies and TV, the technique could influence any realm that blends vision and sound. Think of educational content where videos come with tailored audio cues to aid learning, or accessibility tools that rebuild soundscapes for people with different hearing profiles. In games and virtual reality, designers could layer live, responsive sounds that react to the player’s actions, all guided by natural language prompts. The idea of “sound events as modular building blocks” becomes a practical workflow, potentially accelerating iteration and expanding creative possibility.
But the story isn’t only about speed and convenience. If machines can learn to separate, reweight, and reimagine sound events with textual guidance, we begin to inhabit a new frontier where human intention and machine inference meet more closely. The AI isn’t just generating noise; it’s assembling a sonic narrative that can be tuned, audited, and revised in human time. And because this work leans on publicly accessible datasets rather than bespoke multi-track data, it lowers the barrier to experimentation—an invitation to artists and researchers alike to push the boundaries of what “matched” audio can feel like.
What comes next for sound and AI
The Sony AI group frames this as a first step toward more capable, modular video-to-audio systems. They show that you can train a flow estimator to handle audio conditioning without rebuilding the whole model from scratch. The next moves, they suggest, include refining how tracks are mixed, perhaps with an explicit perceptual model to balance loudness and timbre in a scene-aware way. They also point to expanding the approach to handle more than two or three sound events, enabling a fuller, more faithful sonic tapestry.
There’s also a call to explore richer training data and evaluation methods. The team built a new dataset, Multi-Caps VGGSound, to test how well a video prompts a set of distinct audio tracks. As with many AI audio projects, the realism and usefulness of the results will improve as the community builds better benchmarks, more diverse sound inventories, and nuanced prompts that capture the subtleties of real-world acoustics. The potential is enormous, but the path requires careful attention to data quality, artistic intent, and the ethics of synthetic sound.
In the end, the work embodies a broader trend: the shift from monolithic AI outputs to modular, controllable, human-guided generation. If you squint, it looks like the audio version of a cinematic “workflow” that video editors and Foley artists have cultivated for decades. The difference is that the tools live in a model you can tweak with text, video context, and the sounds you’ve already created. It’s not the final scene, but it’s a compelling new act in the long-running drama between humans and machines in the art of making sound feel true to life.
Institution and authors: The study comes from Sony AI, the research arm of Sony Group Corporation, with Akio Hayakawa, Masato Ishii, Takashi Shibuya, and Yuki Mitsufuji among the principal contributors. The team emphasizes that the approach relies on pre-trained video-to-audio models and a training framework that works with accessible data, aiming to keep the door open for real-world creative workflows.
Bottom line: Step-by-step video-to-audio generation with Negative Audio Guidance nudges audio design toward a future where sounds are built, tested, and tuned track by track—yet still harmonized with the moving image and the creator’s intent. It’s not a finished symphony yet, but it is a well-timed, audibly-true chorus that could change how we think about sound in the age of intelligent machines.