The race to build smarter, faster language models has largely become a race to squeeze more performance out of the same handful of hardware platforms. The result isn’t just a speed bump for users; it’s a kind of hardware exclusivity: code that roars on one vendor’s GPUs can limp on another’s, forcing researchers and engineers to tailor, port, and re-optimize for every new chip. A team from IBM Research Europe argues that the road to real portability—software that runs with equal punch on multiple GPUs without rewriting kernels from scratch—may hinge on a simple idea dressed up in modern machinery: autotuning combined with a just-in-time compiler.
Led by Burkhard Ringlein, Thomas Parnell, and Radu Stoica, the work behind this idea tests whether you can keep a kernel concise and broadly useful while still squeezing out state-of-the-art performance across hardware, all without hand-waving, hand-tuning, or a mountain of vendor-specific code. They focus on the core accelerators of today’s AI stacks—attention kernels that dominate the compute budget of large language models (LLMs)—and they show that letting a system automatically explore a wide space of options can yield impressive results. The study comes from IBM Research Europe, where the team demonstrates not just feasibility but a potential shift in how we design the low-level building blocks of modern AI software.
Portability’s bind: why code stays stubbornly tied to a GPU
The standard approach to getting fast attention on GPUs has been to lean on template libraries that are highly specialized for particular hardware. Think of flash_attn on NVIDIA GPUs or rocm_flash_attn on AMD cards. These libraries deliver blazing performance, but their power comes with a price: tens of thousands of lines of hand-optimized code that must be ported and maintained as new hardware arrives or new problem sizes appear. In the paper’s vocabulary, you have a hardware lottery: a kernel that shines on one platform may be a poor citizen on another, and porting it often requires substantial manual effort. The numbers are striking. The study notes that the native PyTorch implementation uses about 29 lines of code for a kernel but runs 6–13 times slower than purpose-built libraries. The heavyweights—custom templates—can be 23–2,500 times more verbose in lines of code, and porting them across architectures demanded tens of thousands of additional changes.
This isn’t just about raw speed. The code size and complexity become a reliability problem as well: more lines of code mean more room for human error, and more places to introduce subtle bugs when algorithms evolve. The authors argue that this makes adopting new AI methods for new hardware an ever-steeper barrier, slowing the cycle of experimentation that drives progress. To illustrate the difficulty, they point to the fact that porting a single attention kernel to a new architecture in this ecosystem has required large manual rework for substantial portions of the kernel—well beyond a quick tweak.
Beyond templates, a more flexible approach has emerged in the form of Triton, a domain-specific language for writing GPU kernels. Triton kernels are compact and portable, and they can be tuned with a set of hyperparameters—the so-called kernel configurations that govern scheduling, memory usage, and thread organization. The question the IBM team asks is simple but consequential: can we keep the lean, portable code footprint of something like Triton while achieving true performance parity with the best vendor-optimized code—but across GPUs from different vendors?
Autotuning as a bridge between hardware and software
Autotuning is the star here. The core idea is to pair a JIT (just-in-time) compiler with a broad, empirical search over a kernel’s configuration space. Instead of building one “best guess” implementation, autotuning fabricates many variants, tests them on the actual hardware and workload, and selects the winner. This approach can reveal performance opportunities that hand-tuned templates miss, especially as hardware evolves and problem sizes vary. The team uses Triton—an open-source DSL that makes it easier to write GPU kernels in Python—and augments it with a comprehensive autotuner that explores hundreds or thousands of configurations per kernel shape.
Their experiments center on flash attention—the engine that handles the core computation of attention in LLMs—and on RMS layer normalization, two of the most performance-sensitive kernels in modern models. They run tests on two very different GPUs: NVIDIA’s A100-80GB and AMD’s MI250-128GB, to probe true portability across vendors. The autotuned Triton kernel is unchanged across platforms, yet its performance tracks closely with or even surpasses vendor-optimized implementations in many scenarios, all while bearing far less code and without bespoke changes for each platform. In the best cases, the autotuned kernel is about 2.3 times faster than the vendor-optimized baselines; in other circumstances, it still hits a substantial portion of SOTA performance without manual optimizations. And perhaps most striking, the autotuned solution uses dramatically fewer lines of code—70 times smaller—than the largest template-based rivals.
Crucially, the authors show that autotuning unlocks a much larger exploration of the optimization space. They report that their autotuned kernels examine up to 15 times more kernel configurations than traditional methods, revealing a wider spectrum of code variants that can be optimized for memory coalescing, tiling, and other micro-architectural factors. The payoff isn’t just faster code; it’s a tangible form of architectural empathy: the kernel learns to speak the hardware’s language in a way that’s not pre-scripted for a single GPU family.
What the numbers say about cross-GPU portability
One obvious question is whether a single, autotuned kernel can really play nicely on different GPUs. The answer in this study is nuanced but optimistic: the autotuned Triton kernel is broadly competitive across both the NVIDIA and AMD platforms, across race-conditions of sequence length and batch size, with far less code under the hood. This is the sense in which autotuning helps deliver “portable SOTA performance” without recoding for each new device. The improvement is not mere incremental gain; in the best configurations, the autotuned kernel not only holds its own but can outperform specialized vendor code, and it does so with a fraction of the source code.
The study also digs into the cost and complexity that come with cross-platform tuning. For example, reusing a configuration optimized for one GPU on another GPU can dramatically degrade performance—sometimes by a factor of several, sometimes rendering a configuration invalid altogether on the new platform. The takeaway: portability isn’t a matter of shipping one configuration; it’s about letting a system explore a broad landscape of options for each target device, rather than assuming a single recipe will work everywhere. The authors quantify this with an analysis of code diversity: the 450 Triton configurations explored during autotuning produced a richer, more diverse set of PTX instructions than a bank of 30 CUDA templates. That diversity isn’t just a trivia fact—it’s evidence that autotuning can uncover unique, platform-specific optimizations that templates tend to miss.
In one revealing line, the team notes that even with cross-compiled CUDA code on AMD hardware, the autotuned Triton approach can still outperform the cross-compiled baseline in many cases. The implication is clear: portability and performance aren’t mutually exclusive; with the right autotuning infrastructure, you can have both.
Why autotuning isn’t mainstream yet and what’s needed
The authors don’t sugarcoat the gap between theory and practice. Autotuning, as a concept, has a proven track record in academic circles, but real-world adoption faces friction. The paper inventories three big hurdles: the overhead of autotuning itself, the need for a robust API to define configuration spaces and their dependencies, and the challenge of making autotuning results reusable across sessions and deployments. The autotuning process is not free: it requires extra compilation and execution steps to test candidate kernels, and those results must remain valid in future runs if you want to avoid re-tuning from scratch.
Practical autotuning will require: a high-level API that lets developers declare parameter spaces and dependencies; smarter search strategies to prune the tens of thousands of possible configurations down to a handful of likely winners; caching and sharing tuned configurations so teams don’t re-tune for every new deployment; and, ideally, moving autotuning off the critical path—either performing it ahead of time or during idle GPU cycles. The authors lay out a concrete roadmap for these gaps, from reusable autotuning caches to pre-deployment tuning that can ride in the background rather than blocking real-time inference.
A future where portable AI is the default, not the aspiration
The core message of the study is deceptively simple: the bottleneck here isn’t the hardware; it’s the mismatch between software and hardware that autotuning can bridge. If you can harness a JIT compiler to generate multiple kernel variants and pair it with a disciplined autotuning process, you can achieve strong, cross-platform performance without bespoke rewrites for every GPU family. That doesn’t just make life easier for researchers—it could reshape the AI software ecosystem. Portable, high-performance kernels would lower the barrier to experimenting with new hardware, enabling faster cycles of innovation as chip makers compete on true capability rather than on the narrow range of software that runs best on their machines.
The study, conducted by researchers at IBM Research Europe, foregrounds a practical, scalable path toward model portability. The authors—Burkhard Ringlein, Thomas Parnell, and Radu Stoica—are explicit about the trade-offs and the work needed to popularize autotuning. They argue that with a concerted effort to standardize autotuning interfaces, improve search algorithms, and store reusable tuning results, the AI software stack can become more resilient to the vagaries of hardware upgrades and vendor shifts. In other words, portability could stop being a luxury feature of a few open-source projects and become a default expectation of every LLM deployment.
Autotuning could be the lever that finally tilts the balance toward true hardware portability, letting researchers run experiments on a broader set of GPUs without rewriting kernels for each one. If this vision holds, the next wave of AI tooling might look less like a patchwork of vendor-specific hacks and more like a unified, self-optimizing platform that tilts toward curiosity rather than engineering grit. The implications reach beyond speed—we’re talking about a more democratic, adaptable AI stack that invites newcomers to explore hardware choices without fear of locking themselves into a single ecosystem.
In the end, the IBM study doesn’t promise a miracle cure for all portability ills. It does offer a credible, data-backed pathway: embrace autotuning alongside modern JIT compilers, and you get closer to the dream of portable, high-performance AI kernels that work across GPUs from different vendors. It’s a reminder that sometimes the right idea—let the machine experiment for you—can turn the most stubborn problems into opportunities for broader, faster progress.
As the authors note, the road to practical autotuning is not just technical; it’s infrastructural. Building an ecosystem where autotuning results are reusable, sharing them across teams, and weaving autotuned kernels into standard tooling will require community effort, standards, and a willingness to move some work from the fast path to the background. If that vision comes to pass, the next time a new GPU hits the market, the software stack may just adapt on the fly, leaving researchers free to focus on what really matters: bigger models, bolder ideas, and better conversations between code and hardware.