Diffusion Inpainted Wardrobe Delivers Realistic, Consistent Video Try-On

Online shopping promises the dream of trying clothes without a fitting room, but fabric is a moving target. A garment on a hanger looks nothing like it draped around a body in motion, and static product photos can’t capture the sway, creases, or shimmer that come alive as you walk, spin, or sit down. For years, researchers tried to bridge that gap by stitching together still images into video or layering temporal filters on top of frame-by-frame edits. The result often felt like a painting that weakly hints at animation—pretty, but not convincing.

The study behind ViTI—a project from Ant Group’s AI Research division—maps a different route. Instead of starting from a static image and tacking on time, the team treats video try-on as a conditional video inpainting problem. In plain terms: fill in the missing garment pixels across a moving person, so the garment stays faithful to its real world texture and details while moving smoothly frame to frame. Lead authors Cheng Zou and Senlin Cheng spearhead the effort, joined by colleagues Bolei Xu, Dandan Zheng, Xiaobo Li, Jingdong Chen, and Ming Yang.

From Frames to Fabrics: What ViTI Is

ViTI reframes video virtual try-on as a garment-inpaint task conditioned by prompts and reference clothes. The core trick is to generate the missing garment pixels not by warping a garment image onto a body, but by letting a video diffusion model paint them in, guided by a description and a reference garment. The model uses a diffusion transformer with full 3D spatial-temporal attention, meaning it considers all frames together rather than treating each frame in isolation. That single design choice matters: when you ask a system to keep a shirt’s peplum consistent as the wearer tilts, one eye on the whole video helps the pixels stay stitched to the body’s motion rather than slipping or ghosting between frames.

To train something so ambitious, the team built a large, purpose-built data pipeline. They assembled a human-centric dataset called VTP, containing 51,278 video clips of people in various outfits and poses. This data fuels progressive training, starting from broad video inpainting and gradually specializing in garment painting. Along the way, they design a suite of masking strategies—ways of erasing parts of the garment to teach the model to fill them in convincingly. They also introduce a temporal-consistency loss that penalizes differences between consecutive latent frames, nudging the model toward stable motion rather than flickering frames.

The project is what you might call a design principle: video generation first, garment painting second. The architecture relies on a diffusion transformer (DiT) that operates in a latent space compressed by a video VAE, which makes the heavy math tractable. The result is a system that not only preserves garment textures but also respects how folds change as the body moves, a crucial factor for realism in a moving image. In short, ViTI isn’t gluing a static image to a video; it’s teaching a paintbrush to follow a moving outline with care and fidelity.

How ViTI Works Under the Hood

At the mathematical level, the model learns to predict what the hidden latent should look like after adding a little noise, given the noisy latent from the current frame and a set of conditions. The conditioning can be a text prompt, but for ViTI the real magic lies in geometry-aware conditioning: a garment encoder and a pose encoder feed the diffusion process. The garment encoder blends a latent image representation of the target garment with a robust vision feature extractor (DINOv2), producing an embedding that tells the model what the fabric, color, and texture should look like. The pose encoder, built from DensePose features, supplies a spatial prior that anchors the garment to the wearer’s body shape and depth, helping folds align with limbs and torsos as they move.

These embeddings are injected into the diffusion transformer through cross-attention layers, essentially letting the garment and pose information tilt the generation toward the desired result. It’s a shared attention mechanism, but the garment embedding gets its own stream so texture and pattern can survive the pan of a moving person. The model is built to be guided by both text and visual cues, so a designer can say red checks or drop in a garment image to specify look, while the body remains faithful to the captured motion.

Training isn’t a single leap; it’s a three-act process. In Stage 1, the model learns general video inpainting from a broad video dataset with time-invariant and time-variant mask strategies. Stage 2 adds more structured supervision with instance-level masks from video object segmentation datasets, helping the model understand object boundaries in motion. Stage 3 finally trains on a specialized, clothes-focused dataset (the VTP collection), where garment masks and garment prompts align more tightly with the target region. This staged approach helps the model grow more robustly than trying to teach it garment painting in one go.

One more technical flourish matters: the masking strategies themselves. Time-invariant masks erase a fixed region across all frames; time-variant masks let the erased region wander, mimicking how clothing might move in real life. Instance masks and garment masks provide tighter, frame-by-frame guidance on the garment region. The result is a model that can handle varied occlusions, poses, and clothing types, from tops to skirts to pants, without losing texture fidelity or motion coherence.

Why It Matters and What It Could Change

For shoppers, ViTI promises more believable online try-ons: you see a garment not just on a still image but draping and moving with your own silhouette as you walk, stride, or bend. For retailers, it could trim returns and refunds by providing a clearer sense of how a garment behaves in real life. And for designers, the technology acts like a live studio: you can visualize how textiles respond to movement without building a full sample, speeding up iterations and broadening experimentation with color, texture, or cut.

But the implications go beyond pretty pictures. In the age of deepfakes and synthetic media, a tool that can convincingly render clothing on a moving person raises questions about authenticity, provenance, and consent. The authors and their team are careful to couch their work as an enabling technology—one that could empower consumers and designers while demanding thoughtful governance and safeguards against misuse. The VTP dataset, with thousands of clips, also hints at the scale of data needed to train such systems responsibly, underscoring why transparency and data stewardship matter as much as the algorithms themselves.

From a broader tech-ecosystem perspective, ViTI illustrates a trend: moving from image-centric workflows to video-centric generation that respects temporal coherence. The combination of full 3D attention and cross-modal conditioning could ripple into other domains—virtual fashion shows, AR dressing rooms, or even film production where synthetic garments need to move with actors in fluid, high-detail ways. The practical upshot is not just a smarter wardrobe; it’s a new way to choreograph movement, texture, and light across time and space.

As the field marches forward, several questions will shape how ViTI and its kin are adopted. How close to real time can these models operate on consumer hardware? How will we calibrate and audit garment realism across different fabrics, colors, and cultural styles? And who gets to own and curate the data that teaches these systems to dress people? The paper’s authors point to a responsible path forward: high-quality data, explicit temporal regularization, and a design that foregrounds user control and consent. The rest will unfold as fashion, policy, and AI research twist together in real time.