A Hidden Code Keeps AI Characters Consistent

When you flip through a sequence of AI-generated images, you expect the same character to show up again and again—the same face, the same hat, the same mood. Yet in practice, even the best image engines drift. A character who appears in one frame as a confident, sunlit portrait can look subtly different in the next, and before long you’ve got a cacophony of variants rather than a coherent cast. That inconsistency is not just a nuisance for storytellers; it’s a fundamental ceiling on how far generative AI can go in long-form content, whether you’re crafting a graphic novel, a storyboard, or an evolving visual world in a game or film project.

Enter a new approach from researchers at the University of California, Merced and Google DeepMind, led by Hsin-Ying Lee, with Ming-Hsuan Yang and Kelvin C.K. Chan among the authors. Their method—called Contrastive Concept Instantiation, or CoCoIns—proposes a way to keep a subject the same across hundreds or even thousands of independent generations without the heavy lifting of custom tuning or ever-present references. In plain terms: CoCoIns gives each subject a tiny, hidden language code that the AI learns to follow, so you can summon the same person again and again without re-teaching the system every time.

That may sound like magic, but it’s a carefully built bridge between a model’s latent imagination and a user’s consistent intent. The paper’s core idea is to model instances of a concept—the specific faces, outfits, or objects you care about—as distinct, learnable patterns inside the model’s latent space. A mapping network translates a random latent code into a pseudo-word that acts like a tag for a particular instance. When you embed that pseudo-word into the prompt, the image generator uses it to render the same subject in new scenes, across independent creations. The result is a system that delivers both stability (the same subject looks the same, generally) and diversity (different subjects or poses still feel richly varied).

To place this in context: the team’s work builds on a long line of subject-driven generation methods that either tune the model for each subject or rely on extra encoders to digest reference images. CoCoIns promises to sidestep those costs entirely. It’s a collaboration that sits at the intersection of academia and industry, anchored in UC Merced and Google DeepMind, and it centers on a practical, scalable problem that keeps surfacing as AI-generated visuals move from single images to narratives and episodes.

Subject consistency without tuning or references

The problem is surprisingly stubborn. If you tell a diffusion model to generate a portrait of “a man on a soccer field” or “a woman in a fantasy robe,” the system can produce stunning, varied outputs. But the moment you try to keep that same man or woman across several frames, the model tends to reinterpret facial features, body shape, or even lighting in ways that feel like a new subject every time. The traditional workarounds are heavy: you either fine-tune the model on every single subject, or you stitch together outputs from different runs and manually align them. Neither scale feels feasible for a writer, illustrator, or studio trying to produce long-form content.

CoCoIns reframes the challenge as a problem of association in a latent space. Instead of teaching the model each time to “remember” a subject, the researchers train a lightweight mapping network to convert a random latent code z into a pseudo-word w. This pseudo-word then slots into a specific location in the prompt, where a subject descriptor lives (for example, a placeholder like [man] in the prompt). The magic happens because the network learns to link z to a particular instantiation of the concept—an instance that is visually consistent across generations if you reuse the same z.

Crucially, this approach does not require curated references for every subject, nor does it demand the user to drop in a new identity for each piece of content. The model learns the associations in a self-supervised way through a contrastive training scheme. In practical terms, the system creates triplets of prompts and latent codes, then nudges the outputs so that the anchor and a positive example (sharing the same latent code) stay close in appearance, while a negative example (with a different latent code) drifts away. The focus is deliberately placed on the subject region, not the background, so the background can remain flexible and varied while the subject remains recognizable.

The paper’s authors also distinguish their work from batch-based approaches that force all samples in a batch to look alike. CoCoIns operates on a per-creation basis, but it uses the learned pseudo-words to keep a consistent identity across different creations. It’s a bit like giving a character a unique stylistic signature that travels with the latent code, even as the surrounding scene changes dramatically.

From latent codes to consistent subjects

How does this actually feel when you’re in the director’s chair of an AI-assisted project? Think of a latent code as a tiny, private key. You don’t need to know what it looks like, just that reusing the same key yields a subject that “feels” the same, across scenes. The mapping network f takes a code z from a simple Gaussian distribution and outputs a vector w, which is then treated as a pseudo-word. This word is inserted into the text embedding at a deliberate spot in the prompt, just before the target subject descriptor. In effect, you’re telling the image generator, in language-like form, “render this same person again, using this particular cue.”

The researchers don’t stop at the how-do-we-do-this; they also design a learning objective that makes the association robust. To train the system, they construct multiple variations of descriptions for the same image and couple them with different latent codes. The model then learns to connect, for instance, code z1 with pseudo-word w1 so that the anchor image and a closely related positive image land in similar visual space, while a different code z2 tied to w2 lands farther away. They also implement a background-preservation constraint so that changes stay focused on the subject, not the scenery. The result is a more faithful, controllable way to generate long-running sequences where characters stay recognizable even as the plot scatters the action across cities, seasons, or outfits.

In their experiments, the team trains the mapping network on a dataset of human faces (CelebA) and uses an off-the-shelf captioning model to generate prompts, along with a segmentation model to identify subject regions. They measure success not just by how point-for-point similar the faces are (a difficult metric in the wild) but also by how much variety remains across different identities and contexts. They compare their approach against several tuning-free and tuning-based baselines, finding that CoCoIns delivers competitive subject consistency while preserving much greater flexibility and diversity. In other words, you get the best of both worlds: a stable character and the freedom to tell new stories without constantly re-training the model.

Why this matters for creators and culture

The promise of CoCoIns goes beyond a clever engineering trick. It speaks to a broader truth about AI-generated media: the more we rely on machines to generate stories, the more we need reliable, scalable ways to manage characters across time. Writers and illustrators can conceive longer arcs, comics with recurring cast, or cinematic sequences where a hero appears in different contexts without losing their essential traits. It’s a practical form of memory, embedded in the model’s very fabric, that allows for iterative creation without sacrificing coherence.

From a craft perspective, the technology offers a democratizing lift. You don’t need a big budget to curate dozens of reference images for every character, nor do you need a team of engineers to re-tune a model for each subject. A single latent code can be reused to spawn an entire episode’s worth of imagery, each frame carrying the same identity while inviting fresh settings, emotions, and action. That’s a powerful enabler for independent creators, small studios, and first-time storytellers who want to experiment with long-form visuals without getting tangled in the wires of machine learning logistics.

Of course, with power comes responsibility. Techniques that preserve identity—even of fictional characters—raise questions about authenticity, misrepresentation, and the potential for misuse. The authors themselves note that their work focuses on subject consistency, not on replicating real people. Still, as with any tool that can push the boundaries of visual realism, the ethical landscape deserves attention. What counts as fair use in a serialized narrative? How do we prevent the technique from being weaponized to impersonate someone without consent? These are questions the community will need to wrestle with as such methods become more capable and accessible.

On the bright side, CoCoIns hints at a future where AI-driven storytelling feels more like directing with a reliable cast rather than assembling prop-by-prop with random face-matching. If you’ve ever watched a film or read a graphic novel where a hero’s appearance subtly shifts from page to page, you know how jarring that can be. A stable, controllable subject gives creators room to focus on mood, pacing, and world-building, safe in the knowledge that the cast will stay consistent as the plot unfolds.

What comes next for AI storytelling

The paper’s authors are frank about the path ahead. They’ve demonstrated single-subject consistency and have hinting results for multi-subject scenarios and general concepts beyond faces. The next hurdles include scaling to diverse categories beyond people, refining how well multiple characters can coexist in the same image, and further reducing any residual entanglement between subjects and backgrounds. There’s also interest in making the approach even more plug-and-play, so creators can drop in a new character with a minimal setup and recommendations about how that identity should look and behave across scenes.

One practical implication is that tools built on this approach could become standard features in story-boarding, animation pre-visualization, or game design pipelines. Imagine a writer sketching a scene where multiple recurring figures interact, then letting CoCoIns ensure that each figure appears consistently across dozens of variations and lighting conditions. The underlying ideas could also spill into other domains—animals, objects, or even fantastical concepts—once the system learns new associations between latent codes and concept instances. In short, CoCoIns offers a scaffold for a more memory-rich, flexible form of AI-generated content that keeps up with the tempo of storytelling.

The UC Merced and Google DeepMind team, led by Hsin-Ying Lee and supported by Ming-Hsuan Yang and Kelvin C.K. Chan, have laid down a clear, human-friendly path to more coherent AI imagery. It’s a reminder that the best leaps in AI storytelling don’t just push the ceiling higher; they also raise the floor so that more people can participate in crafting compelling visual narratives. If the last few years have felt like a rapid-fire sprint through new capabilities, CoCoIns offers a pause—an invitation to slow down, plan your characters, and let the machine help you carry them through an entire story with consistent personality and presence.

Lead researchers Hsin-Ying Lee, Ming-Hsuan Yang, and Kelvin C.K. Chan, from the University of California, Merced and Google DeepMind, spearheaded the work, with several co-authors contributing to the experimental design and analysis. The study advances a practical form of consistency in generative AI, one that could reshape how we think about characters, worlds, and long-form visuals in the age of diffusion models.