In the growing chorus of artificial intelligence that can describe a photo, translate a caption, or answer a riddle about a chart, a stubborn question keeps echoing: can these systems really combine multiple skills at once, or do they stumble when the task demands several abilities at the same time? It’s a bit like asking a chef to both sear a steak and bake bread perfectly in a single dish. The closer the challenge gets to real life—recognizing objects, counting them, judging spatial relationships, and drawing conclusions—the more the limits of current visual language models become apparent. The big tension is not just about teaching a model to see more things, but to thread together several distinct skills so that the whole is more capable than the sum of its parts.
The Princeton–Meta collaboration behind COMPACT (COMPositional Atomic-to-complex Visual Capability Tuning) confronts this very challenge. The team asked a more nuanced question: could we curate a training curriculum that teaches models to combine atomic visual capabilities—the smallest building blocks of perception—into richer, multi-step abilities? If so, could a model learn to answer questions that require counting and color recognition alongside spatial understanding, all in one go? The answer, at least in their experiments, is a resounding yes. The work comes out of Princeton University, with contributions from Meta AI, and is led by Xindi Wu and Hee Seung Hwang, supported by Polina Kirichenko and Olga Russakovsky.
What follows is not a journey through every technical twist, but a guided walk through the core idea: train the model not by cramming more of the same kind of questions into it, but by deliberately shaping the kinds of questions it sees. The goal is to help the model build a layered intuition for how different visual capabilities fit together, like a musician learning to harmonize different skills into a single piece of music. The payoff, the authors argue, is both practical and profound: you get strong performance on complex, four-capability tasks without exploding the data budget, and you boost the model’s ability to generalize to even tougher challenges it hasn’t seen before.
The Compositional Challenge
Modern multimodal language models tend to excel at straightforward vision–language tasks. They can name objects, describe colors, or answer simple questions about a scene. But when the questions require juggling multiple skills—counting objects while comparing colors and evaluating spatial relationships—the models falter. This is not just a curiosity; it mirrors real-world tasks where a user asks, for instance, how many red cars are in a scene and where they are relative to a blue building, all while noting the scene’s overall context. The problem, the paper argues, is that most Visual Instruction Tuning (VIT) datasets are dominated by simple queries that require only one capability. The training curve flattens when faced with higher compositional complexity, a phenomenon some researchers have described as a cliff: once you ask for two or more capabilities at once, the model’s performance can drop sharply.
In other words, the difficulty isn’t just about scale. It’s about how the model experiences the task space during training. If the training data rarely asks the model to combine capabilities, it won’t learn how to do so on its own, even if the architecture has the capacity to reason in more sophisticated ways. This is the core puzzle COMPACT seeks to address: can we design the data itself to cultivate compositional reasoning, rather than relying on sheer data volume to push models toward it?
To put it in human terms, imagine teaching someone to drive by repeatedly guiding them through short, isolated drills—stop signs, parallel parking, or highway merging—without ever presenting a situation that requires them to combine these skills on a busy street. You’d want a curriculum that forces the learner to blend the pieces, in context, until the whole operation becomes fluent. COMPACT proposes a similar pedagogy for machines: build from atomic, simple capabilities up to composite tasks, and do so with a careful distribution that ensures higher-complexity scenarios are not just occasional afterthoughts but regular training fare.
From Atomic Skills to Complex Reasoning
The heart of COMPACT is a four-step data recipe that transforms a handful of foundational visual skills into a ladder of increasing complexity. First, the team defines 10 atomic visual capabilities, grouped into Attribution (color and shape), Recognition (object recognition, action recognition, text recognition, spatial recognition, counting), and Relation (spatial relationship, object interaction, scene understanding). These are the elemental moves a model must master to interpret a scene. A model trained to recognize a red car, a blue chair, and a sign with legible text has the building blocks; what COMPACT wants is the choreography that puts them together to answer richer questions.
Second, they sample k atomic capabilities for a given training example, with k ranging from 1 to 3. The dataset is deliberately balanced across these complexity levels, rather than skewed toward the simplest tasks. This matters: balanced exposure to k = 1, 2, and 3 ensures the model encounters training signals that require combining capabilities at multiple levels, not just one, and learns to integrate them in meaningful ways.
Third, a conversational generator (powered by a capable closed-source model) is prompted to craft questions that weave exactly the chosen set of capabilities into a single, natural prompt. The prompt constraints are precise: the question must require visible information from the image, answers should be concise, and the query must not rely on dangling connectors or disjointed pieces that could be solved without looking at the image. A JSON template records which capabilities are required for each question, keeping the data scaffold tight and auditable.
Fourth, the team applies a verification pass to weed out ungrounded or ambiguous queries. They only keep questions that truly require the specified capabilities and that cannot be answered by the question text alone. The result is a curated set of 2–3 high-quality conversations per image (or up to 10 verification attempts), ensuring that the data is not just abundant but genuinely capable-driven.
Finally, COMPACT stitches this compositional tuning data with a small slice of the traditional LLaVA-665K VIT data. In effect, the model continues to learn instruction-following styles from the standard VIT data, while the compositional tuning data teaches it how to combine capabilities in a controlled way. The mix is deliberately modest—about 5% of the original VIT data plus 32,000 compositional samples, out of a total training budget well under 10% of the VIT data scale. This is where the art of data efficiency shines: you get comparable or better performance on many benchmarks without bloating the dataset.
In practice, this means you’re not just teaching the model to recognize more things; you’re teaching it to reason about how those things fit together. A model trained with COMPACT can, for example, answer a question that requires recognizing multiple objects, counting them, and understanding their spatial arrangement—all in one shot. The data recipe thus acts like a recipe card for making the kind of flexible, compositional reasoning that humans use when they interpret a complex scene.
What the Results Really Mean for AI
When the dust settles, COMPACT’s empirical results are striking. The researchers trained a LLaVA-v1.5-7B-LoRA model on their COMPACT data and then evaluated it across a suite of challenging benchmarks designed to test multi-capability visual understanding: MMStar, MM-Vet, SeedBench2Plus, TextVQA, InfoVQA, and others. Across the board, the COMPACT model held its own against a full-scale LLaVA-665K VIT baseline, even though it used only about 10% of the data budget. In some tasks—particularly those that demand four or more atomic capabilities—the gains were even more pronounced. On MMStar, for instance, the model improved by about 83.3% relative performance, and on MM-Vet the improvement climbed to around 94.0%, compared to the standard VIT baseline on those tougher questions.
The upshot is not just that COMPACT works; it works efficiently. The authors report an average relative performance of around 100.18% across benchmarks when combining their 32K compositional samples with a small VIT seed. In other words, with a tenth of the data, the model can match or surpass the performance of the full VIT approach on several tasks, and it outperforms it on a number of complex, multi-capability challenges. The data-efficiency gains aren’t a side effect; they’re the core story: explicitly teaching compositionality in training data reshapes how well a model can generalize when real-world prompts demand more than one perceptual trick at a time.
What about growth beyond what COMPACT was trained on? The paper dives into that, showing the model’s ability to generalize to higher compositional complexity (k = 4 or 5) even though those levels weren’t explicitly present in the training data. On MM-Vet, COMPACT reaches 57.5 with k = 4, compared with 32.5 for the baseline, and on MMStar, it hits 64.7 for k = 4, compared with 35.3 for the baseline. The takeaway is not just stable performance at known difficulties, but robust compositional generalization to unforeseen, harder tasks—a kind of conceptual elasticity in the model’s reasoning that’s hard to achieve through raw scale alone.
Beyond the numbers, the study offers a provocative view of how to think about model training. Instead of assuming that more data will automatically yield smarter models, COMPACT suggests a principled way to structure the data so that the model learns the architecture of thinking itself. It’s a shift from “bigger is better” to “smarter scaffolding.” When combined with the right balance of instruction tuning data, this approach yields a recipe for training that is not only powerful but scalable to more ambitious tasks in the future.
Limitations and a Map for the Next Frontier
No scientific work is a flawless blueprint, and COMPACT is no exception. The authors acknowledge two important limitations. First, much of the compositional data is generated using a closed-source model (Gemini). That choice brings advantages in speed and expressiveness, but it also embeds the biases and blind spots of the generator into the training data. The authors plan to release the generated data publicly to support reproducibility, but the underlying generation process remains a potential bottleneck for those who want to reproduce the exact pipeline with open tools. Second, COMPACT focuses on visual-centric compositionality. Tasks that hinge on knowledge or math or non-visual reasoning may require additional strategies that extend beyond the atomic-to-complex framework described here.
Looking ahead, the paper suggests several avenues for extension. Pushing the compositional complexity beyond k = 3 would require higher-fidelity data generation or hybrid pipelines that blend multiple data sources. There’s also room to explore step-by-step reasoning or explicit decomposition as a way to further strengthen multi-capability reasoning without sacrificing data efficiency. And as with many cutting-edge AI projects, the path to real-world deployment will hinge on careful attention to data provenance, fairness, and robustness across diverse visual contexts.
In the end, COMPACT offers a compelling argument that the next leap in multimodal AI might come not from bigger models alone but from smarter training curricula. By engineering how models see, count, relate, and reason about scenes, the researchers show that you can coax richer behavior from the same architectural backbone. It’s a reminder that, in AI as in life, how you teach the system matters just as much as what you teach it.
Lead institutions: Princeton University and Meta AI. Lead researchers: Xindi Wu and Hee Seung Hwang, with Polina Kirichenko and Olga Russakovsky. This work argues for a future where data curation itself becomes a form of architectural design, steering models toward flexible, multi-capability intelligence that feels less like memorization and more like genuine reasoning.