ARMS Makes Memory Tiering Tune Itself Without Knobs

What memory could feel like if it stopped needing knobs

In the modern data center, memory is the new electricity bill. The cost of keeping data close to the processors matters as much as the speed of the processors themselves. To stretch memory budgets, researchers have been layering different kinds of memory into tiers: fast but expensive DRAM on one end, slower but cheaper technologies like non-volatile memory on the other. The idea is elegant: keep the hot, frequently used data in the fast tier and park the cold data farther away. The dream is a system that behaves like a smart, self optimizing librarian, shifting shelves on the fly so the right books are always within arm’s reach.

But reality rarely hands you a perfect librarian. Past memory tiering systems relied on fixed knobs and thresholds to decide which pages of data belong in the fast shelf versus the slow shelf. Those knobs are not one size fits all. A workload that dances to a different rhythm, a batch of data with a peculiar access pattern, or a faster link between tiers can render a chosen threshold suboptimal. The result is wasted bandwidth, delayed promotions of hot data, or unnecessary migrations that churn memory like a busy airport baggage carousel.

This is where the University of Wisconsin–Madison team behind ARMS steps in. Led by Sujay Yadalam, with Konstantinos Kanellis, Michael Swift, and Shivaram Venkataraman, they dug into what actually makes tiered memory work well across diverse workloads and hardware. They found that the real win comes from letting the system learn and adapt instead of clinging to static rules. The product of that insight is ARMS: Adaptive and Robust Memory tiering System. It promises to deliver strong, stable performance without manual tuning, by design rather than by luck.

In this article we explore what ARMS is, why the old knob-driven approach struggles, and how ARMS reimagines the problem so memory just works, no matter the workload or hardware—an idea with implications for everything from AI training to real-time databases and beyond.

Why tuning memory tiering is a bigger problem than it looks

The core difficulty in memory tiering is deceptively simple: what counts as hot data can change from moment to moment. Some pages become hot because a graph algorithm suddenly converges on a new region of the data; others stay hot for minutes, then fade away. Past systems tried to solve this with fixed rules. They would sample disk or memory references, count how often a page was touched, and compare that count against a threshold to decide if the page is hot. If the number of references to a page exceeded the threshold, it would be promoted to the fast tier; if it fell below, it would be demoted. The problem is that the threshold is a blunt instrument. The same value can be right for one workload on one machine and disastrously wrong for another workload or another machine.

To illustrate the fragility, the authors studied several existing tiering engines, such as HeMem, Memtis, and TPP. They showed that out of the box, these systems often outperform naive no-tiering baselines, but their gains hinge on the knobs being carefully tuned to the workload. When the knobs are tweaked, gains can be substantial. But there is no universal best setting. A workload like GapBS-BC on one graph might love a certain hot-threshold and promotion cadence, while Silo-YCSB on another setup might demand an entirely different balance. And tuning is not a one-and-done task. It depends on input data sizes, thread counts, and the bandwidth relationship between fast and slow memory. In practice, tuning a fleet of machines to run a mixed workload is a job for a small army of memory-nerds—one that hardly scales in a data center.

The UW–Madison team crystallized three recurring patterns that emerged when tuning was beneficial. First, tuning helps identify hot pages earlier and more accurately. If you catch hot data early, you avoid a volley of slow-tier accesses that stall the whole application. Second, tuning reduces wasteful migrations—promoting a page when it’s about to fade away is a fast track to thrashing. Third, tuning accelerates migrations so hot pages actually reach the fast tier in time to make a difference, rather than arriving after the moment of peak usefulness. In short, tuning unlocked the potential of memory tiering, but it also highlighted how brittle threshold-based systems can be when the workload shifts even slightly.

From those insights arose a new blueprint. If threshold tuning is essential but costly and brittle, can we build a tiering engine that sidesteps thresholds altogether yet still adapts to changing workloads and hardware realities? The answer, according to ARMS, is yes. The idea is not to replace memory tiers with a magic wand, but to replace the rules that govern data placement with policies that learn and adapt in real time. That is the spirit of the following design and its bold claim: you can get near-tuned performance without any tuning at all.

ARMS at a glance: three pillars of threshold-free adaptation

ARMS rests on three core ideas, each built to remove the reliance on hand-tuned thresholds while still delivering high performance across a wide range of workloads and hardware configurations.

First, hot and cold pages are identified not by fixed cutoffs but by relative scoring. Each data page carries metadata about how often it has been touched, combined with fast and slow views of its recent history. Specifically, ARMS maintains two moving averages of page accesses: a short-term one that reacts quickly to change, and a long-term one that smooths over noise and stabilizes decisions. The hotness of a page is then computed as a weighted combination of these two scores, and pages are ranked accordingly. The fast tier holds the top-k pages by hotness, where k is the fast tier’s capacity. Because the ranking is dynamic and based on the workload’s current behavior, ARMS avoids the brittle threshold altogether while still promoting the data most in need of speed.

Second, ARMS embraces workload change rather than fighting it. It uses a change-point detector to sense when the workload shifts in a meaningful way, particularly when hot data begins to migrate toward or surge within the slow tier. When such a change is detected, ARMS temporarily shifts into a mode that prioritizes short-term trends, enabling newly hot pages to rise quickly to the fast tier. Once the system stabilizes again, it returns to a more robust history-based scoring. This recency-aware behavior ensures ARMS responds promptly to real shifts in the application’s memory access patterns without overreacting to fleeting blips.

Third, ARMS grounds migrations in a cost-benefit calculus. A promotion is performed only if the expected performance benefit—reduced latency from moving a page to the fast tier—exceeds the migration cost in bandwidth and time. The system also uses a multi-round promotion filter to avoid chasing one-off bursts and to ensure persistence in hot-page behavior before promoting. And when migrations do occur, ARMS uses batched migrations, adjusting the batch size to avoid interfering with the running application while still exploiting available bandwidth. It is a practical blend of speed, caution, and bandwidth awareness.

Taken together, these ideas form a threshold-free architecture that remains sensitive to the workload and the hardware it runs on. The result is a memory tiering engine that routinely adapts its behavior to the moment, rather than relying on a fixed rule that may or may not fit the moment. It is a design philosophy as much as a technical strategy: act locally on short-term signals when needed, but anchor decisions in robust long-term trends, and always weigh the cost against the benefit before you move memory around.

The anatomy of ARMS: how hotness, change, and migrations come together

Delving into ARMS, you can see how the system maps the messy tempo of real workloads into a coherent memory management approach. The per-page metadata is modest in footprint—about 20 bytes per page at 2 MB page granularity. Even a highly active 200 GB dataset would only incur a few megabytes of metadata, keeping the overhead tiny in the grand scheme of a data-center memory system. ARMS relies on hardware performance counters, notably PEBS-like samples, to gather per-page access data. The sampling price is low, and the authors report a few percent CPU overhead at typical sampling rates. The payoff is a memory manager that learns from real access patterns rather than pretending to know them a priori.

The hotness score at the heart of ARMS is a blend of two exponential moving averages, one quick and one slow. The short-term average captures rapid surges in activity, while the long-term average preserves stability and helps avoid overreacting to short-lived fluctuations. Importantly, ARMS adapts the relative weights of these two averages depending on the workload’s phase. In stable periods, the long-term history dominates, protecting against churn. When a shift is detected, the system nudges the scoring toward recency, letting newly hot pages ascend quickly.

Deciding when to promote is where ARMS’ cost-benefit logic shows its strength. Each promotion candidate is evaluated against a proposed cold page to demote, using a simple but powerful fairness criterion: promote if the benefit of moving to the fast tier, scaled by access counts and hot age, surpasses the cost of migration. The demotion side of the coin uses a similar logic, but with different priorities: it favors moving cold pages out of the fast tier to free space for genuinely hot data. The result is a balanced system that avoids wasting precious fast-tier capacity on data that won’t stay hot long enough to be worth the move.

When it comes to moving pages, ARMS abandons serial promotion in favor of a priority-based, batched approach. The hottest pages are promoted first, but not whenever they first cross a threshold; instead, ARMS uses a multi-round filter that requires sustained hotness before a page is queued for promotion. This guardrail prevents a flurry of premature migrations, a common source of inefficiency in threshold-based systems. The actual data movement is batched according to real-time bandwidth availability, ensuring the application’s throughput is not disrupted. The system can also scale the batch size up or down based on observed bandwidth, which helps maintain steady performance even when the workload ebbs and flows.

From lab tests to real-world behavior: how well does ARMS actually work?

ARMS was evaluated against several well-known tiering engines, including the default and tuned configurations of HeMem, Memtis, and TPP. The tests spanned a mix of workloads that stress different parts of memory systems: graph analytics, in-memory databases, indexing, and micro-benchmarks with irregular access patterns. The hardware setup included machines that emulate modern memory hierarchies, including Optane-based slow tiers and CXL-like memory in NUMA configurations. Across these environments, ARMS consistently delivered strong performance without any tuning, reversing the brittleness that plagues many knob-based systems.

On a machine with a slow tier made of non-volatile memory, ARMS outperformed the existing state of the art by about 1.26x to 2.32x on average without any tuning. Even more striking, ARMS came within 3% of the best tuned performance. In other words, ARMS often matched the gains you’d achieve after a careful, workload-specific parameter sweep, but without the perpetual tuning cycle. That is not just a margin of victory; it is a fundamental shift in how confident we can be about deploying memory tiering in heterogeneous environments.

Several factors behind ARMS’ gains deserve emphasis. First, the threshold-free hot/cold classification proved more robust than static thresholds across diverse workloads. By using a short-term EWMA to capture rapid shifts and a long-term EWMA for stability, ARMS avoids both under-promotion during surges and over-promotion during temporary spikes. Second, the change-point detector, which hinges on monitoring slow-tier bandwidth, reliably flagged when a workload’s hot set was changing. This let ARMS switch into a recency-focused mode just long enough to catch the new hot pages, then return to the long-horizon view once the pattern stabilized. Third, the cost-benefit approach kept migrations anchored in real gains, eliminating many wasteful promotions and demotions that would otherwise burn memory bandwidth and degrade throughput.

Beyond raw numbers, ARMS demonstrated resilience across hardware and configuration. When pushed onto NUMA-based emulation of CXL memory with a different thread count, ARMS still outperformed the baseline and matched tuned configurations closely. The authors also showed that ARMS remains effective even as the ratio of fast-to-slow memory shifts, a common scenario as data centers experiment with different tiers. The upshot: ARMS isn’t a one-trick pony for a single machine; it adjusts its behavior to the hardware you actually have in play, and to the workload you’re running.

What ARMS could mean for the future of memory in data centers

The take-home message from ARMS is not merely that memory tiering can be better tuned. It is that memory tiering can become robust enough to feel almost effortless in practice. If you think about the implications, a few themes stand out. First, data centers could deploy tiered memories at scale with far less manual tuning, freeing engineers to focus on higher-level questions like data locality and application design rather than micromanaging thresholds. Second, the improved efficiency translates into tangible cost savings. Faster data access, fewer migrations, and better bandwidth utilization can lower both hardware costs and energy consumption, which is a big deal as AI workloads push up memory footprints and as energy budgets tighten.

There is another, subtler consequence. As memory becomes cheaper per byte and companies push more workloads into large, multi-tiered configurations, systems like ARMS could enable more predictable performance. For latency-sensitive services—real-time analytics, interactive databases, streaming pipelines that must keep pace with live data—having an automatic, robust memory manager could be a quiet but essential driver of user experience. It is not just about moving pages around more cleverly; it is about giving software a memory system that can adapt to the moment, in real time, in the same way modern software already adapts to changing network, CPU, and storage conditions.

Of course, ARMS is not a universal cure-all. The authors acknowledge that no system is perfectly tuned for every possible workload, and practical deployments will still benefit from contextual hints in some cases. They also point to future enhancements, such as tighter integration with hardware-managed tiering and more sophisticated latency-aware policies that consider access times in addition to raw bandwidth. But the core contribution is a shift in mindset: move away from fixed thresholds and toward adaptive policies that learn from the workload itself. That is a direction with clear practical payoff and fertile ground for further innovation.

Why this matters for researchers, engineers, and curious readers alike

ARMS is a reminder that the most powerful improvements in complex systems often come from rethinking incentives and signals. Instead of forcing data into a preconceived rubric, ARMS listens to the data’s own rhythm and adapts its behavior accordingly. It is a philosophy that resonates beyond computer memory into how we design systems that must operate in dynamic, imperfect real-world environments—from autonomous vehicles negotiating unpredictable roads to AI systems adjusting to live data streams in the wild.

For researchers, ARMS offers a blueprint for building robust, tuning-free components that still deliver strong performance across diverse contexts. For engineers and data-center operators, it promises a way to extend memory capacity and reduce costs without the tail-risk of fragile, knob-heavy configurations. And for the curious reader, it is a story about how systems learned to stop fighting change and start embracing it, turning a once brittle complexity into a resilient, self-optimizing partner.

Highlights: ARMS delivers tuning-free, threshold-free memory tiering; it uses dual moving averages to identify hot data; it detects workload shifts via change-point analysis; it employs a cost-benefit migration policy and batched, bandwidth-aware migrations; it achieves near-tuned performance without manual parameter tuning across diverse workloads and hardware. The project is led by researchers at the University of Wisconsin–Madison, with Sujay Yadalam as the first author and collaborators Konstantinos Kanellis, Michael Swift, and Shivaram Venkataraman.