On a quiet afternoon, you walk into a living room and your brain instantly sketches a map: a chair tucked beside a coffee table, a plant basking in the corner, the TV mounted above a console. You can tell who’s where, who’s touching whom, and even how close one thing sits to another. That effortless spatial chatter is something artificial systems struggle to reproduce. COPA-SG, a project born at the University of Augsburg in Germany, aims to give machines a richer, more precise dialect for describing rooms and their relationships. The team behind COPA-SG—led by Julian Lorenz and colleagues—has built not just a bigger dataset, but a fundamentally different way of talking about space: one that adds numbers to words and even imagines hypothetical scenes.
Traditional picture grammars, the so‑called scene graphs, describe scenes as triples like (chair, next to, table). They’re useful for making AI explanations legible, but they’re often noisy, incomplete, and tethered to human judgments. If you’re a robot trying to navigate a real apartment, a binary label like “next to” misses the nuance you actually rely on: how close is “next to”? From which viewpoint is the chair behind the sofa? And what should the robot do if a new object—say, a lamp—were placed in a particular spot? COPA-SG tackles these questions head‑on by replacing fuzzy human labels with precise, machine‑readable annotations. The result is a dense, synthetic gallery of scenes where every possible relation is mapped, every angle is measured, and even what would happen if you added a new object is thought through ahead of time.
In a field crowded with clever ideas but sparse ground truth, COPA-SG arrives as a kind of scalpel: a dataset that’s exhaustive by design, not merely comprehensive by accident. It’s not just about having more data; it’s about having data with a dependable language to describe space—parametric relations that quantify distance and angle, and proto-relations that describe hypothetical, future possibilities. The scale is ambitious: more than 1,200 indoor scenes, thousands of views per scene, and a total of over 86 million relation annotations. And crucially, every scene is annotated with precise parameters, not just labels. This provides a sturdy foundation for training models that can actually reason about space like humans do—reasoning that can guide robots, assist in design, or help virtual assistants plan around a room’s real geometry.
A finer map of space for scenes
The heart of COPA-SG is the move from binary relationship labels to parametric relations. Imagine you’re labeling a scene and you want to say not just that one object is near another, but that they are 50 centimeters apart. Or you want to say that one object sits at a particular angle relative to another—say, the chair is 31 degrees to the left of the sofa from the viewer’s perspective. COPA-SG stores these kinds of details as parameters attached to each relation. The researchers define a compact vocabulary of predicates, including directional relations like in front of, behind, above, below, left, and right, plus distance-based relations like next to and touching. Each relation becomes a six-tuple: subject, object, predicate class, a numeric parameter, a camera perspective, and a test direction. In short, a relation is no longer a rough label; it’s a precise statement with spatial semantics baked in.
But space is not only about where things sit; it’s about how we view them. COPA-SG introduces directional relations that depend on the camera’s viewpoint, and it also accommodates camera‑independent relations that lean on the object’s own pose. The dataset uses a voxel-based method to measure distances with centimeter-level granularity, enabling exact measurements like “50 cm apart” rather than coarse generalities. It also uses a ray-sweep technique to determine front and back surfaces, constructing a reliable sense of orientation that’s robust to clutter and occlusion. The result is a richer, more human-like language for describing spatial layouts—one that can be used by downstream systems to plan, reason, and act with a sense of space that feels tangible rather than abstract.
The second big upgrade is proto-relations. Rather than locking a scene into present-tense facts, proto-relations encode hypothetical relationships conditioned on the future presence of new objects. For example, a proto-relation might say “somewhere next to the TV,” or define a volume behind a sofa where a new object would satisfy a particular relation with the anchor. Proto-relations effectively sketch out a space of possibilities, a mental sandbox that an intelligent agent can consult when planning where to place something, or when simulating how a scene might evolve. It’s a semantic “constructive geometry” layer that invites future agents to reason about not just what is, but what could be.
To achieve all of this, COPA-SG relies on a robust synthetic pipeline. The dataset draws from Infinigen, a procedural world generator that can create diverse, photorealistic indoor scenes, and then applies a deterministic, automated annotation process. The scale is striking: 86 million relation annotations across 1,200 scenes, with a typical view producing tens of thousands of relations and a per-scene average of around 72 thousand relations. And unlike many benchmarks that cherry-pick salient relationships, COPA-SG aims for complete, exhaustive graphs of each scene. That completeness matters because it removes a key bottleneck in training and evaluation: if your ground truth is partial or biased toward obvious relationships, your model learns to see only a subset of the space. COPA-SG trains models to see the entire space and to measure just how well they understand it.
Even at the data-engineering level, COPA-SG is designed for practicality. The dataset provides segmentation masks, depth maps, and surface normals alongside every image, enabling a rich fusion of cues for learning. It’s not just about “seeing” 2D pixels; it’s about using 3D geometry to ground relations in real, measurable space. The authors also map Infinigen’s object names to a consistent set of classes so learners can reason about familiar object categories even when the synthetic generator invents new textures or shapes. In short, COPA-SG is built as a bridge: from vivid, 3D synthetic worlds to meaningful, shareable, and trainable graph structures that reflect how spaces actually work.
From synthetic data to smarter reasoning
One of COPA-SG’s practical aims is to give scene-graph models a real chance to learn fine-grained, multi-relationship reasoning. Traditional models often struggle when asked to predict more than one or two relationships per object pair, or when precise distance or angle matters. COPA-SG’s parametric setup demands a new kind of prediction: for each subject-object pair, the model must decide not only whether a relation exists, but also what the numeric parameter should be. The paper adapts a two-stage model, DSFormer, to predict both existence and parameter values. During training, the model outputs a binary existence flag for each relation class and a numeric parameter for those that exist. The loss terms are carefully crafted: a binary cross-entropy loss for relation existence and specialized angle and distance losses that encourage accurate, physically meaningful values. It’s a rare case where the research both extends architecture and defines the learning targets with mathematical care, all while staying faithful to the data’s semantic intent.
The result is a more expressive kind of scene graph. Instead of a single binary label, the model learns to produce a vector of flags and a corresponding vector of parameters, representing a spectrum of relations that can be present in a scene. This enables richer downstream tasks such as planning, navigation, and manipulation for embodied agents. It also invites more nuanced evaluation. COPA-SG introduces non-trivial metrics that respect the multi-label nature of the data: mean average precision for relation existence, and mean absolute error for parameter values. In short, the evaluation framework itself recognizes that “how” a relation exists can be as important as “whether” it exists.
Another notable aspect is multi-view evaluation. COPA-SG scenes aren’t flat snapshots; they come with multiple views of the same space. The team demonstrates that aggregating across views can substantially improve the detection of relations, especially the camera-dependent ones like right or in front of. The gains grow with more views, saturating around 15 views per scene, which is a practical hint for future embodied systems: moving around a room to collect different vantage points can yield a much richer, more reliable map of relationships than a single photo ever could. This multi-view angle aligns with how humans understand space—by peering from different corners, glimpsing from above, and circling objects until a coherent, triple-checked map emerges.
Beyond the core dataset and models, COPA-SG provides a practical reasoning toolkit. The authors outline a framework that converts a predicted COPA-SG graph into a Neo4j database and then uses a small language model to generate Cypher queries from user prompts. The result is a kind of bridge between symbolic graphs and natural-language intents: you could ask a system, in plain words, to answer questions like “Where should I put this lamp so it sits next to the shelf without blocking the door?” or “How many wineglasses are on the kitchen counter?” The proto-relations layer even supports questions about hypothetical placements, letting you experiment with space in a reversible, queryable way. It’s not magic, but it is a living testbed for spatial reasoning that could feel surprisingly familiar to any human who has rearranged a room to accommodate new furniture.
What this means for AI in real rooms
The COPA-SG project matters because it reframes how we teach machines to understand space. If a robot can reliably answer not just what is where, but how far away it is, what angle separates two objects, and what would happen if a new object popped into the scene, then planning, navigation, and interaction become more robust and intuitive. Consider a home-assistant robot tasked with tidying up. It needs to reason about where to place a vase so it’s “next to” a plant at a safe distance from the table edge, while also considering the person who’s walking through the room and the angle at which the vase would be visible to the human eye. Parametric relations give it the precision to satisfy those constraints; proto-relations give it a mental map of what could be placed where to satisfy multiple constraints at once. The end result is a system that can reason like a person who can picture a room from multiple angles and imagine future layouts before lifting a finger.
The paper also foregrounds the value of synthetic data. COPA-SG argues that high-quality ground truth—especially ground truth that’s exhaustive and parameterized—can train models more effectively than real-world data alone, particularly when collecting perfect annotations is impractical. Synthetic data isn’t a shortcut; it’s a controlled laboratory where every relation is defined, every distance measured, and every potential future arrangement explored. When such data is paired with modern learning architectures and clever evaluation, the models trained on it become better partners for humans: more predictable, more interpretable, and more capable of reasoning across multiple viewpoints and hypothetical scenarios.
What COPA-SG offers beyond a dataset is a language for space that machines can actually speak. Parametric relations turn vague associations into numeric realities. Proto-relations create a sandbox for planning and design. And the synthesis—thousands of scenes, millions of relations, billions of microdecisions about angles and distances—gives downstream systems the kind of experiential grounding that often separates robust AI from clever heuristics. When you combine this with a practical reasoning toolkit—transforming graphs into queries, or letting small language models manipulate proto-relations to craft constructive geometry—the line between “seeing” and “doing” starts to blur in the most useful way.
All of this comes from a coordinated effort at the University of Augsburg, where the COPA-SG project demonstrates that better graphs don’t just describe scenes more crisply; they empower agents to think about space with a more human-like sense of proportion. The lead authors—Julian Lorenz and colleagues—show that you don’t need to wait for perfect perception to begin planning in the real world. You can begin by building a richer map of space, one that labels not just where things are, but how they relate, how far apart they sit, and how the scene would change if you added a new object. It’s a bold reminder that the quality of your questions—the precision you demand from your scene graph—often dictates the quality of your answers when a machine has to move through a real room with real people in it.
As researchers push this approach further, COPA-SG could become a foundational resource for embodied AI, robotics, interior design, and even augmented reality, where a precise, actionable map of space matters as much as an eye-catching image. The dataset’s exhaustive grounding and its dual innovations—parametric and proto-relations—offer a language for spatial reasoning that’s closer to human intuition than previous graph-truths. In a future where AI agents will navigate our homes, plan our layouts, or assist with tasks that hinge on spatial nuance, COPA-SG provides a sturdy, scalable vocabulary to describe not just what is, but what could be in the rooms we share with machines.