In the world of computer vision, turning a handful of photos into a believable map of the world is like assembling a city skyline from silhouettes. The problem is not just about finding where all the buildings stand, but about knowing what the buildings’ sizes and angles really are. If you don’t know exactly how a camera sees the world, your reconstruction can resemble a funhouse mirror more than a faithful blueprint. This is the stubborn crux of structure-from-motion, the field that reconstructs 3D scenes from 2D pictures.
The new work from Lund University, led by Carl Olsson and Amanda Nilsson, pushes this frontier forward by asking: can we use what we already know about a camera to pull the reconstruction toward metric realism without getting stuck in the usual initialization traps? In other words, can we build near-metric maps from scratch, without painstaking, step-by-step bootstrapping that often drifts over time?
From projective chaos to metric clarity
When cameras are uncalibrated, the math behind turning images into 3D points is effectively underdetermined. Think of trying to measure a sculpture through frosted glass: you can see rough shapes, but precise sizes, angles, and distances stay ambiguous. In this regime, many different 3D explanations can explain the same photos, related to a family of transformations that warp space but keep the image projections intact. This is what researchers mean by projective ambiguity.
Traditional approaches often ride along two tracks at once: they build up a reconstruction incrementally (adding cameras and points) or they try to solve a global problem with as much data as possible at once. Both routes wrestle with non-convex landscapes—lots of local minima that trap the solver. A popular workaround has been the so‑called pOSE framework, which replaces the usual reprojection error with a surrogate objective that is invariant to projective transforms. The upside is robustness to bad initial guesses; the downside is that, on its own, it tends to produce reconstructions that are only determined up to a similarity transform rather than a true metric scale and shape.
Olsson and Nilsson, working at Lund University, show that you can tilt the odds in favor of a metric-like result by bringing in one well-known quantity: the camera calibration. By weaving relative rotation information between camera pairs into the optimization, they nudge the solution away from purely projective distortions. In essence, they add a compass that points toward the real geometry, not just a believable projection. It’s as if you’re not only reconstructing a city from silhouettes, but you’re also anchoring some of the buildings to true sizes and angles using what you know about the cameras that captured the images.
Rotation as a calibration key
The key idea is elegantly simple in spirit: relative rotations between cameras carry information about the intrinsics, and these rotations are only invariant to similarity transformations (scale, rotation, and translation). By estimating pairwise rotations between viewpoints and penalizing deviations from those estimates in a carefully designed way, the method injects calibration knowledge into the pOSE framework without forcing an exact calibration on every camera. This creates a bridge between two camps: the robustness of initialization-free surrogate errors and the realism of calibrated geometry.
Practically, the authors extend the optimization graph that underpins pOSE by adding edges that connect camera nodes with penalties calibrated to how far a candidate solution strays from the observed relative rotations. Rather than chasing perfect orthogonality constraints on rotated camera matrices (which can trap the solver in local minima), they craft a smoother path that respects the geometry of rotations while still letting the solver roam freely where the data is noisy. They even introduce a lifting trick that makes the rotation penalties compatible with second-order optimization techniques, a move that helps the algorithm glide toward good solutions rather than grind to a halt in a quagmire of nonlinearities.
In the language of the paper, this is a cohesive fusion of pOSE with rotation averaging. The relative-rotation terms act like a subtle, data-informed scaffold that keeps the solution from wandering into projective “fake geometry.” It’s not that the camera intrinsics are forced into the model; instead, the optimization is nudged to preserve metric features of the scene where the data supports them. The result is a problem that remains initialization-robust, yet yields reconstructions that behave more like real meters and degrees of angle, rather than abstract projective shapes.
A path to reliable 3D reconstructions
How do you actually solve this hybrid objective? The authors lean on a technique called the variable projection method (VarPro), which exploits the fact that the pseudo-object-space error is bilinear in the unknown camera matrices and the 3D points. This lets them solve for the points in closed form given the cameras, then iterate to refine the cameras. It’s a dance where the 3D coordinates and the camera parameters keep stepping in and out of view as the other catches up, a choreography that tends to converge quickly to good solutions when guided by the rotation penalties.
In their experiments, the Lund team compared several variants: a pure pOSE objective, pOSE augmented with a diagonal penalty that enforces near-orthogonality, and a version that strictly parameterizes rotations (which risks getting stuck). They also tested their rotation-averaged, pairwise-rotation-penalized approach. Across a spectrum of datasets with hundreds of cameras and thousands of 3D points, the results were telling. The rotation-averaged method consistently converged to the global minimum with high probability from random starting points, outperforming the more naive rotation-parameterized or purely orthogonality-penalized versions in terms of both reliability and speed.
Crucially, the method doesn’t pretend to magically reveal perfect calibration. Instead, it nudges the optimization toward a state that is nearly metric. The researchers quantify this by looking at the fundamental matrices between camera pairs: their approach yields matrices that are much closer to essential (i.e., truly metric) than the baselines, especially as the dataset size grows. In other words, the reconstructions are visually plausible and geometrically faithful where the data supports them, without requiring a separate, brittle upgrade step that remaps projective results into a calibrated, real-world frame.
In the end, the study presents a practical route to near-metric structure-from-motion that preserves the robustness of initialization-free approaches. It’s a win for anyone hoping to deploy large-scale 3D reconstruction in the real world, whether that means mapping urban scenes, enabling augmented reality, or powering autonomous robots that need a reliable sense of space without endless fine-tuning.
Institution and authors: The work is from Lund University in Sweden, led by Carl Olsson and Amanda Nilsson. Their approach shows that incorporating relative rotations into a pOSE-based objective can push reconstructions from projective silhouettes toward near-metric reality, with convergence guarantees that are meaningful in practice.
Beyond the specifics of the algorithm, the broader implication is a shift in how we think about initialization in 3D reconstruction. Rather than tolerating a fragile, data-hungry startup sequence, this line of work suggests a design principle: use every reliable cue available—like known camera calibration—to gently steer the optimization toward the geometry that matters for real-world use. It’s a theme that resonates with other AI and vision systems, where robustness often comes from blending principled geometry with data-driven signals.
As with all scientific work, there are caveats. The method’s strength rests on having nevertheless useful relative rotation estimates between camera pairs, which require solving two-view problems with sufficient correspondences. In practice, noisy data or sparse views can still test the limits of the approach. Yet the results across multiple datasets—some with over a hundred cameras—suggest a promising path toward scalable, initialization-friendly, calibrated reconstructions that don’t rely on brittle upgrades or hand-tuned starts. If the trend holds, we could be looking at a future where getting a trustworthy 3D map from a casual photo stash becomes the norm rather than the exception.