When AI Models Break Apart to Scale Up on a Supercomputer

Table of Contents

Rethinking AI Serving at a Massive Scale

Large language models (LLMs) like GPT and its successors have dazzled us with their ability to generate text, answer questions, and even write code. But behind the scenes, running these models at scale is a monumental engineering challenge. The latest research from Huawei’s xDeepServe team, published in 2025, reveals how they tackled this by radically reimagining how LLMs are served on their CloudMatrix384 SuperPod — a supercomputer with hundreds of AI chips interconnected at blistering speeds.

At its core, the problem is this: as LLMs grow bigger and more complex, simply throwing more hardware at them isn’t enough. The models themselves have evolved to use a Mixture-of-Experts (MoE) architecture, where only a small subset of specialized “expert” modules activate for each token processed. This design boosts efficiency but demands intricate coordination across hundreds of AI processors (NPUs). Meanwhile, Huawei’s CloudMatrix384 offers a tightly coupled environment with 384 Ascend 910C chips linked by a high-bandwidth fabric, enabling shared memory access across the entire pod.

Breaking the Transformer Into Pieces

Traditional transformer models are monolithic beasts, with attention and feedforward layers tightly intertwined. The xDeepServe team introduced a bold idea called Transformerless: disaggregate the transformer into modular units — attention, feedforward, and MoE experts — and run each independently on dedicated NPUs. This separation allows each component to scale on its own, reducing resource contention and improving fault isolation.

This modular approach is akin to an orchestra where each section rehearses separately but plays in perfect harmony during the concert. By decoupling compute and memory-heavy parts, the system can independently optimize each stage, leading to better utilization and lower latency.

From Prefill to Decode: A Two-Stage Pipeline

Serving LLMs involves two main phases: prefill, where the model processes the input prompt and builds a key-value (KV) cache, and decode, where it generates output tokens step-by-step. These phases have different computational profiles — prefill is compute-bound and dynamic, decode is memory-bound and more static.

xDeepServe implements a disaggregated prefill-decode architecture, running prefill and decode on separate NPUs. Prefill runs on both older 910B and newer 910C chips, while decode exclusively uses the high-speed CloudMatrix384 910C nodes to leverage the super-fast interconnect. This design balances cost and performance, enabling rapid response times even for large MoE models.

Mastering Communication with XCCL

One of the biggest hurdles in scaling LLM serving is communication overhead. The team developed XCCL, a custom communication library that exploits CloudMatrix384’s global shared memory. Unlike traditional network communication, XCCL uses memory-semantic operations to transfer data directly between NPUs with microsecond latency.

XCCL supports point-to-point transfers, as well as complex all-to-all communication patterns needed for MoE dispatch and combine operations. For example, when routing tokens to experts or aggregating expert outputs, XCCL’s efficient protocols keep data flowing smoothly without bottlenecks.

Balancing the Load of Experts

MoE models rely on routing tokens to a subset of experts, but some experts get overloaded while others sit idle, causing slowdowns. xDeepServe’s Expert Placement Load Balancing (EPLB) algorithm continuously monitors token distribution, identifies “hot” experts, and replicates them across multiple NPUs. Tokens are then evenly distributed among replicas, smoothing out the workload and reducing latency by over 40% in tests.

Decentralizing Serving with FlowServe

To scale across hundreds of NPUs, the team redesigned their serving engine, FlowServe, into a decentralized system. Instead of a central scheduler, FlowServe organizes NPUs into Data Parallel (DP) groups, each managing its own pipeline independently. This eliminates single points of failure and bottlenecks, allowing the system to handle massive workloads with low latency.

FlowServe also incorporates smart scheduling, proactive garbage collection to reduce jitter, and multi-token prediction (MTP) to speculate multiple tokens ahead, boosting throughput without sacrificing accuracy.

Disaggregated MoE and Attention: A New Frontier

Taking disaggregation further, xDeepServe separates MoE experts and attention computations onto different NPUs. This introduces asymmetry — more NPUs for experts than attention — which complicates communication. The team’s clever trampoline forwarding technique uses a subset of expert NPUs as intermediaries to balance traffic and reduce metadata overhead.

They also run persistent kernels on MoE NPUs with concurrent streams for receiving data, computing, and sending results, avoiding costly CPU interactions and maintaining microsecond-level scheduling granularity.

Reliability at Scale

With hundreds of NPUs working in concert, failures are inevitable. xDeepServe employs multi-tiered heartbeat mechanisms and link probing to detect faults quickly. Their recovery strategies evolved from coarse cluster restarts to fine-grained component failover and token recomputation, ensuring the system stays online and responsive even amid hardware glitches.

Performance That Pushes Boundaries

Deployed in production, xDeepServe powers large-scale DeepSeek, Kimi, and Qwen models. It achieves an impressive 2400 tokens per second per Ascend 910C chip while meeting a tight 50 ms time-per-output-token SLA. The system handles input sequences up to 96,000 tokens and output lengths up to 32,000 tokens, isolating long-sequence workloads to prevent interference.

Why This Matters

xDeepServe’s innovations highlight a crucial shift in AI infrastructure: as models grow, so must the sophistication of the systems that serve them. By breaking transformers into independently scalable pieces and leveraging ultra-fast shared memory fabrics, Huawei’s team demonstrates a path forward for serving ever-larger LLMs efficiently and reliably.

This work is a testament to the power of co-designing hardware and software — where architectural choices in chips, networks, and algorithms come together to unlock new levels of AI performance. As LLMs continue to scale, such disaggregated, decentralized serving systems will likely become the backbone of next-generation AI services.

Research institution: Huawei Cloud xDeepServe Team

Lead contributors: Ao Xiao, Bangzheng He, Baoquan Zhang, Baoxing Huai, Bingji Wang, and many others

Breast screening gaps mapped by data, not guesswork

Hidden Black Holes Shape the X-ray Sky’s Glow

Gaia unearths hidden dwarf carbon stars across the sky

Does a Warped Disk Hide a Black Hole’s Spin?

The Quiet Guardrails Keeping Self Driving Code Portable

Do Singular Matrices Harbor a Hidden Rule?

When AI Models Break Apart to Scale Up on a Supercomputer

Rethinking AI Serving at a Massive Scale

Breaking the Transformer Into Pieces

From Prefill to Decode: A Two-Stage Pipeline

Mastering Communication with XCCL

Balancing the Load of Experts

Decentralizing Serving with FlowServe

Disaggregated MoE and Attention: A New Frontier

Reliability at Scale

Performance That Pushes Boundaries

Why This Matters

Rethinking AI Serving at a Massive Scale

Breaking the Transformer Into Pieces

From Prefill to Decode: A Two-Stage Pipeline

Mastering Communication with XCCL

Balancing the Load of Experts

Decentralizing Serving with FlowServe

Disaggregated MoE and Attention: A New Frontier

Reliability at Scale

Performance That Pushes Boundaries

Why This Matters

Related News