Can a frozen transformer still learn new moves?

In the world of artificial intelligence, bigger has often meant better. Large language models keep swelling, soaking up more data, parameters, and compute until they feel almost like a force of nature. Yet there’s a growing ache behind the thrill: how do we keep quality from exploding alongside cost? A team of researchers led by Maciej Stefaniak at the University of Warsaw has a fresh answer. They call it Projected Compression, a way to shrink the backbone of these models without tossing away their old memories or forcing a costly retraining sprint. It’s a bit like teaching a pianist to play a smaller, lighter keyboard while still letting them access the full score in their head.

What makes this idea compelling is less a single flashy trick than a rethinking of what it means to compress a model. Instead of pruning away neurons and pretending the removed notes never existed, Projected Compression keeps the original model intact, frozen in place, and instead learns how to route information through a smaller, trainable subspace. The base weights stay put, but the model learns how to compose a compressed representation that still sings with the same training tune. It’s a generous approach: you don’t erase capability, you cleverly redirect it.

The study is a collaboration across several institutions in Poland, including the University of Warsaw, IDEAS NCBR, the Polish Academy of Sciences, Nomagic, and Wroclaw University of Science and Technology. The authors—led by Stefaniak and including Michał Krutul, Jan Małański, Maciej Pióro, and peers—frame Projected Compression as a principled alternative to the standard playbook of pruning and retraining. Their experiments with Transformer-based language models suggest that this method not only preserves quality but can outpace traditional hard pruning once you push into larger, higher-quality base models. In short: shrink the footprint without losing the braincells.

A new way to shrink giants without unplugging their brains

Conventional model compression often relies on pruning: set aside the parts of the network deemed less important, then retrain to recover the lost performance. The shortcoming is obvious in the name: you prune away, and sometimes what you prune cannot be recovered. The PC team flips that script. They add trainable projection modules that sit on top of a frozen, full-sized base model. The idea is simple in formula but rich in consequence: WC = P1 W P2, where W is the original weight matrix, and P1 and P2 are learnable projection matrices that squeeze inputs and outputs into a smaller space. The resulting WC is the compressed weight matrix actually used during forward passes, while the original W remains intact and accessible during training. The model, therefore, maintains its full capacity in the learning process, even as the deployed representation becomes leaner.

Think of it like using a smart translator that can compress a novel’s meaning into a tiny, efficient cipher during use, while keeping the original text intact in a locked cabinet for reference. The translator learns which cipher shapes best capture the essence of the story without discarding the long vocabulary stored in the original book. In practice, PC can be configured with one- or two-sided projections, and there’s even a residual term Wr that helps the compressed matrix stay flexible as training progresses. All of this happens while the base weights stay frozen, kept safe from the churn of every training step. The key is that the computation per token during training remains roughly the same as training the full Transformer, but the memory footprint and the final deployed size shrink in a principled, learnable way.

What is particularly striking is the framing of the compression process as a trainable projection problem that operates within and alongside the frozen core. The researchers emphasize that this isn’t a clever hack; it’s a reimagining of where the model’s “truth” lives during learning. The model continues to interact with the most important dimensions of the base matrix, only now through the pathways carved by P1, P2 (and Wr). The forward pass effectively becomes xWC, which can be viewed as a two-step journey: first, the input is projected into the base model space via P1, then a full W does its heavy lifting, and finally the result is projected back down through P2. The end result is a compact, trainable proxy that retains access to the full informational landscape while presenting a much smaller surface area for deployment.

How the math meets the medicine cabinet of AI practicality

On the surface, Projected Compression resembles a classic trick from dimensionality reduction: reduce the way data flows through a matrix, then learn the best way to route that flow. But there are two sharp edges here. First, the base matrix W stays frozen. That’s unusual in compression schemes, which typically retrain the entire network after pruning. Second, and crucially, the projection modules are trained with standard gradient descent, which means you don’t need a separate, heavy optimization routine just to make the compressed version sing. In aggregate, you’re getting a model that is cheaper to run at inference time and cheaper to train during compression, with a training cost per token that matches the base Transformer’s footprint.

The authors further argue that the approach scales up gracefully with the size and quality of the base model. In their experiments with GPT-2-style bases at 300M and 800M parameters, the Projected Compression method consistently outperformed the hard-pruning baseline when compressed by 50% and beyond, especially as the token budget grows. In other words, the bigger the brain you start with, the more you benefit from learning how to route its thoughts through a smaller cognitive corridor. The take-home message is nuanced but hopeful: compression isn’t just a matter of tearing away bits; it can be about learning smarter ways to access the bits you already own.

One practical detail the paper highlights is where the efficiency comes from. If you train with large batch sizes—a common setup in modern AI work—the extra cost of tracking the projection matrices is negligible in the backward pass. You still get the same per-token compute as a retrained, pruned model, but with a crucial difference: you retain the full expressive potential of the base weights during learning. The “memory tax” you pay is real, because you must store both the full frozen base and the trainable projections, but the authors point to techniques like gradient checkpointing to mitigate that. It’s a trade-off: you’re temporarily carrying a larger memory footprint to unlock a more compact, high-quality final model that can be more easily deployed in constrained environments.

Why this matters beyond the lab

Compressing language models is not just an academic exercise. It’s a practical bottleneck that shapes who can build, test, and deploy AI systems. The current generation of LLMs is expensive to train and even more expensive to run at scale. If researchers and companies can shave memory and compute without sacrificing performance, the barrier to experimentation drops. That could accelerate the iterative process that yields safer, more capable systems, because more minds can tinker with more models in a reasonable time frame.

Projected Compression doesn’t pretend it’s a silver bullet. It’s a careful, well-mannered invitation to rethink where information lives inside a neural network. It suggests that the intelligence embedded in the base matrix doesn’t have to be erased to be practical; it can be accessed through a different, leaner interface that learns to mimic the larger brain’s behavior. The method also aligns with a broader movement in AI toward parameter-efficient fine-tuning and modular design—where you freeze the core, add a few well-chosen adapters, and let the learning happen in the margins rather than in the central organ itself. That philosophy could be transformative for research labs with limited hardware, startups exploring new ideas, and educational settings where students learn by editing real-world models without blowing through budgets.

There’s a cultural edge to this as well. AI research has often felt like a race to build bigger engines, with efficiency improvements treated as afterthoughts. Projected Compression reframes the conversation. It’s not about hoarding power; it’s about redistributing it more thoughtfully—keeping the model’s memory accessible during the learning phase while presenting a smaller, more practical face to the outside world. If the trend continues, compression could become not a status signal of a model’s scale but a standard design choice embedded in how we think about deploying AI in the real world.

Beyond the immediate results, the study nudges us toward a more inclusive future for AI experimentation. Universities and research centers with modest compute budgets might find it easier to test ideas that used to be off-limits because retraining a heavily pruned model would squander precious cycles. And for practitioners, a key question remains: how far can this approach scale across architectures beyond the Transformer, or across modalities beyond text? The paper hints at that possibility, inviting follow-up work that could generalize the projection philosophy to other neural building blocks and perhaps even to domains like vision or multimodal systems. The door is ajar, and the room beyond is full of possible ways to think about compression not as a subtraction but as a reconfiguration of how a model’s brainlights are wired.

Ultimately, the people behind Projected Compression remind us of the quiet art of making something smaller without erasing what makes it powerful. The lead author, Maciej Stefaniak, along with his colleagues at the University of Warsaw and collaborative institutions, present a technique that respects the ocean of data contained in a large model while offering a practical raft for engineers navigating the choppy seas of deployment. Their message is not just technical—it’s aspirational: we can keep the best of what we’ve built, even as we learn to fit it into a smaller, faster, more accessible form.

Key takeaway: Projected Compression changes the compression game by keeping the full base model’s memory intact during learning, while using learned projection matrices to deliver a compact, efficient final model. It’s a principled dance between preserving capacity and trimming the fat, with promising implications for who can build, test, and deploy sophisticated AI in the real world.

Lead researchers and affiliations: The work is from the University of Warsaw and collaborators at IDEAS NCBR, the Polish Academy of Sciences, Nomagic, and Wroclaw University of Science and Technology, led by Maciej Stefaniak with co-authors including Michał Krutul, Jan Małański, Maciej Pióro, Jakub Krajewski, Sebastian Jaszczur, Marek Cygan, Kamil Adamczewski, and Jan Ludziejewski.