When a model looks at a picture and answers a question, it’s absorbing a lot more than a caption. It’s parsing space, distance, and the delicate choreography of objects in a scene. That is the hard part: grounding language to the visual world. Researchers are deploying clever shortcuts to speed things up—pruning tokens, trimming tokens, making the model skinnier. But in the realm of grounding, slimming down can hollow out meaning. A study from National Tsing Hua University in Taiwan, led by Tzu-Chun Chien with colleagues including Shiang-Feng Tsai and Ruei-Chi Lai, shows why token pruning often backfires on grounding tasks. The authors demo that drastic performance drops aren’t inevitable, but they are predictable when spatial structure is disrupted. And they don’t just describe a problem; they offer a fix that’s elegant in its simplicity.
The core idea is surprisingly modest: preserve the map the model uses to understand where things sit in an image, even as you prune away the rest. The result is a method called Grounding-Aware Token Pruning, or GAP, which acts like a careful chiropractor for the model’s sense of space. GAP keeps the original position IDs intact while tokens are dropped, and it does so without extra training, memory, or compute. In other words, it’s a drop-in adjustment that slips into the existing pruning toolbox and brings grounding back from the brink. The authors tested GAP across multiple models—LLaVA variants, MiniGPTv2, Shikra, and more—and across several pruning strategies. The story is as much about what went wrong as what hopped back into place when GAP arrived.
The token pruning paradox for grounding
Pruning visual tokens is a straight‑ahead way to make vision‑language models faster. A high‑resolution image can produce thousands of tokens; sending them through a transformer—the workhorse underlying these models—can be expensive. So researchers trim the token set and keep only the “most important” tokens according to some scoring rule. In many cases, the pruning works beautifully for questions that don’t hinge on precise spatial relations. But for Referring Expression Comprehension (REC)—the task of localizing an object described by a natural language expression—the math changes. Grounding is about the geometry of the scene as much as it is about the language describing it. And when the tokens are pruned, the model’s sense of where things sit starts to fray.
In concrete terms, the study shows sharp drops in REC accuracy when pruning is applied. On RefCOCO, a standard REC benchmark, a popular LLaVA model dropped from 56.14% accuracy with all tokens to 15.34% after pruning via a CLS‑visual similarity method. That is not a small dip; it’s a collapse. The same pruning approach on MiniGPTv2 sent accuracy from a high 88.69% down to a meager 2.73% on the same task. These aren’t esoteric numbers; they map to real-world failures—misunderstanding which pixel belongs to which object, misplacing a bounding box, or simply guessing where to draw the search line. The effect was even more dramatic in grounding contexts than in more general visual question answering. The lesson is clear: pruning is not a neutral act when the model must reason about space and relations.
Two misalignments that break spatial grounding
The authors dug into why pruning wrecks grounding and settled on a surprisingly specific culprit: misalignment between visual tokens and their position IDs. In a typical setup, a camera image is broken into patches, each assigned a position ID that encodes its place in the image. Those tokens then ride into the language model, riding on their positional scaffolding. If you prune tokens, two things can happen. First, the remaining tokens can be reordered relative to their original sequence, so their associated position IDs no longer line up with where the model thinks they “should” be. Second, removing tokens can shift the remaining tokens to the front of the line, forcing a new, compressed set of position IDs on them. In short: the spatial map that tells the model where things sit becomes a scrambled or compressed jumble.
To test these ideas, the researchers designed experiments that teased apart the effects of permutation and shift. They demonstrated that even without removing tokens, simply permuting the order of tokens and recomputing their position IDs caused grounding to degrade. This showed that the spatial grounding problem isn’t just about fewer tokens; it’s about the misalignment between the tokens and the coordinates that anchor them in space. They also probed how well the vision encoder’s own position information survives deeper into the pipeline. The answer: spatial information in the ViT’s representations fades as features move toward the LLM. By the time the features reach the language component, the model has to rely more on its own constructed position IDs, which makes it particularly vulnerable to pruning’s side effects. This is the core intuition behind GAP: if you can keep the positional scaffolding intact, grounding can rebound even when tokens are trimmed.
GAP: Aligning position IDs without extra cost
GAP is elegantly simple. When pruning removes tokens, GAP preserves the original position IDs as if nothing had been pruned. It then adjusts how rotary embeddings—those sine‑cosine based positional encodings used in transformers—are applied so that the language model still receives a coherent sense of space. In practice, this means the model continues to interpret the remaining tokens as if their spatial relationships hadn’t been scrambled. Importantly, GAP does not require retraining, additional memory, or extra computation during inference. It’s a surgical adjustment that sits atop existing pruning methods, not a full re‑engineering of the model.
That simplicity is part of the method’s power. By keeping the original positional structure, GAP preserves the relational cues that ground language to the image. The authors describe GAP as an add‑on to pruning, not a replacement for it. You can apply GAP to a wide range of pruning strategies—CLS‑visual pruning, text‑visual pruning, random pruning, or spatial pruning—and you will not pay a performance price in terms of speed or memory. It’s a compatibility layer for spatial reasoning inside a broader efficiency strategy.
How well does it work? Across five different models and six pruning methods, GAP consistently lifts grounding accuracy back toward unpruned levels. In the strongest example, LLaVA v1.5 at 7B parameters, pruning with CLS‑visual settings reduced RefCOCO val accuracy to 15.34%. Adding GAP brought it to 51.42%, a jump that approaches 90% of the original, unpruned performance. Similar recoveries appeared across other architectures, including LLaVA v1.6 at 13B, Llama‑based variants, MiniGPTv2, and Shikra. In short: GAP doesn’t just patch a single model; it appears to be a general solution to a global mismatch between tokens and their spatial anchors.
Beyond grounding: Across models, datasets, and tasks
The researchers didn’t stop at a single dataset. They tested GAP across RefCOCO and its variants, as well as other vision tasks where grounding isn’t the goal, such as general VQA benchmarks. The results show a delicate balance: GAP cures grounding drops without harming non-grounding abilities. In some VQA datasets—GQA, VizWiz, OK‑VQA—the fixes translate into small but meaningful gains, and in several cases, there is no measurable downside. That matters because it suggests GAP is not a fragile trick that only helps in a narrow corner case; it’s a robust improvement that respects the broader abilities of multimodal models.
To quantify the generality, the team performed cross‑method and cross‑ratio tests. They applied GAP to a spectrum of pruning strategies and to token reduction ratios from 0.2 to 0.8. The pattern held: grounding performance recovered across the board, sometimes with even larger gains as more aggressive pruning was used. They also tracked inference efficiency and found GAP adds no extra overhead. Time‑to‑first‑token and memory profiles were essentially unchanged compared with pruning without GAP. In other words, you don’t trade speed for accuracy with GAP—you gain reliability without paying a bill.
Why this matters for a future of AI that sees and reasons
The GAP finding sits at a crossroads of practicality and vision. As multimodal models grow bigger and cheaper hardware becomes a channel of power, the temptation to prune aggressively will only grow. GAP offers a persuasive answer to a stubborn question: can we keep models fast without blunting their deepest capabilities—their ability to ground language in space? The answer the paper gives is yes, with a surprisingly small adjustment to a detail many practitioners might overlook: how we encode positions.
There are broader implications beyond the technical tweak. Grounding is a critical ingredient for reliable AI in the real world. If a robot or a digital assistant can’t accurately locate objects in a scene, it can misinterpret instructions, miss safety cues, or give wrong directions. The finding—that misalignment between tokens and position IDs is a central vulnerability in pruned models—signals a new design principle: spatial coherence should be safeguarded even as we strip away tokens for efficiency. GAP embodies that principle by proving you don’t have to choose between speed and spatial understanding.
What’s more, the study nudges us to rethink how we evaluate pruning strategies. The same token reduction that speeds things up can quietly erode the relational knowledge the model uses to reason about a scene. In practice, this means researchers and engineers may start checking grounding metrics earlier in the pruning pipeline, ensuring that efficiency improvements don’t come at the expense of core capabilities. It’s a reminder that the “simplify” impulse in AI must be balanced with a respect for the architecture of knowledge that actually makes these systems useful.
From a broader perspective, GAP is a small revelation about how memory and space intertwine in AI reasoning. It reframes a routine engineering choice—pruning tokens—as a question about which components a model treats as sacred: the spatial scaffolding that keeps track of where things sit in the scene. The upshot is not just better numbers; it’s a more trustworthy form of intelligence, one that can still reason about space even when we push the model to run faster and leaner. That’s the kind of progress that makes a difference when AI moves from demonstration to deployment in the real world.
In the end, the study from National Tsing Hua University, with lead author Tzu-Chun Chien and collaborators, invites a simple but powerful takeaway: the best way to prune the brain is to prune with a map in hand. GAP doesn’t erase the cost of computation, but it preserves the map that lets models understand where things are, and that is where grounded reasoning lives. The result is a model that can be both faster and wiser about space—an update that feels less like a patch and more like a recalibration of how we teach machines to see the world.