Why Dual LLM Inference Fails on APUs: DDR5 Bandwidth Limits

A weekend spent building a multi-model agent system ended abruptly when benchmarks exposed a harsh reality: dual LLM inference on AMD APUs collapses under DDR5 bandwidth constraints. The culprit isn’t GPU compute power—it’s the single shared memory pipeline handling both models.

The Hidden Math Behind MoE Models

My initial assumption that a 35-billion-parameter model would overwhelm a system was wrong. After weeks of using qwen3.6:35b as a daily coding assistant at 17.8 tokens/second, I wondered if a lightweight sidecar model could handle quick classification while the larger model managed reasoning. What I missed was the Mixture of Experts (MoE) architecture powering qwen3.6:35b—256 experts with only eight activated per token. This means each generated token activates roughly 4-5 billion parameters worth of compute, not 35 billion. The model’s per-token cost resembled that of a 4-billion-parameter dense model, making a dual-model setup seem feasible. But benchmarks proved otherwise.

Hardware Constraints of Integrated GPUs

Testing on a Minisforum UM790Pro revealed the core limitation: its AMD Ryzen 9 7940HS CPU and Radeon 780M iGPU share a single DDR5-5600 memory bus (~80 GB/s bandwidth). There’s no dedicated GDDR6 pool—just 48 GB of GPU-accessible memory carved from the same system RAM. Every operation—CPU inference, GPU inference, and KV cache storage—competes for the same bandwidth.

Ollama served as the testing framework, with four models evaluated:

qwen3.6:35b (36B MoE)
gemma4-e2b-abliterated (4.6B)
qwen3:4b-instruct (4B)
qwen2.5:1.5b (1.5B)

Forcing CPU-only inference required setting num_gpu: 0 in API calls, ensuring no layers were offloaded to the iGPU. This isolation helped quantify how memory contention impacts performance.

Benchmark Results: The Cost of Sharing

Tests measured tokens/second across solo and dual-model runs, using identical prompts. The findings were stark:

Solo Performance:

qwen3.6:35b (GPU): 17.8 tok/s
gemma4-e2b-abliterated (GPU): 42.9 tok/s
qwen3:4b-instruct (CPU): 19.6 tok/s
qwen2.5:1.5b (CPU): 53.4 tok/s

Dual-Model Performance Hits:

Both on GPU (qwen3.6:35b + gemma4-e2b):
35B GPU model dropped to 13.1 tok/s (-26%)
4.6B GPU model fell to 25.3 tok/s (-41%)

GPU + Tiny CPU (qwen3.6:35b GPU + qwen2.5:1.5b CPU):
35B GPU model: 14.9 tok/s (-16%)
1.5B CPU model: 26.2 tok/s (-51%)

GPU + Medium CPU (qwen3.6:35b GPU + gemma4-e2b CPU, num_gpu=0):
35B GPU model: 13.0 tok/s (-27%)
4.6B CPU model: 13.4 tok/s (-53%)

GPU + Large-Context CPU (qwen3.6:35b GPU + qwen3:4b-instruct CPU, num_gpu=0):
35B GPU model: 11.6 tok/s (-35%)
4B CPU model: 11.1 tok/s (-43%)

The worst-case scenario involved a 256K-context 4B model whose KV cache ballooned to 24.2 GB, saturating the shared memory bandwidth alongside the 35B model’s 32 GB GPU allocation.

The Single-Bus Bottleneck

On discrete GPUs, CPU and GPU operate over separate memory buses—DDR5 for CPU, GDDR6 for GPU—eliminating contention. APUs, however, rely on a unified memory architecture where both Zen 4 CPU cores and RDNA 3 compute units access DDR5 through a single controller. LLM inference is memory-bound; every token requires streaming model weights, and MoE models amplify this demand by reloading expert weights per token.

DDR5-5600 (96 GB) -- ~80 GB/s shared
+-------------------+
|                   |
|  CPU cores        | 780M iGPU (12 CUs)
|  (Zen 4)          |
|                   |
+-------------------+
       | SAME MEMORY CONTROLLER |

Even the "best" dual-model setup (35B GPU + 1.5B CPU) incurred a 16% performance hit on the larger model. The tiny 1.5B model barely dented bandwidth but still halved its own throughput due to the 35B model’s dominance.

Rethinking Agent Architectures for APUs

The original goal—a planner-executor agent with the 35B model handling reasoning and a small model managing tool calls—collapsed under these constraints. Memory bandwidth becomes the critical bottleneck, not compute. For APU-based systems, the trade-off is clear: either accept severe performance losses or avoid dual-model setups entirely.

Future work could explore offloading one model to a secondary system or leveraging quantization to reduce memory footprint. Until then, APU users must weigh the convenience of integrated GPUs against the performance penalties of memory contention.

AI summary

Benchmarks show dual LLM inference on AMD APUs cuts performance by over 50% due to shared DDR5 bandwidth. Discover why MoE models worsen the bottleneck.

Why Dual LLM Inference Fails on APUs: DDR5 Bandwidth Limits

The Hidden Math Behind MoE Models

Hardware Constraints of Integrated GPUs

Benchmark Results: The Cost of Sharing

The Single-Bus Bottleneck

Rethinking Agent Architectures for APUs

Comments

OpenSparrow 2.6 adds AI search, bulk editing, and 60+ shortcuts

How supply-chain trust cracks in developer tools and CI pipelines

How Analytics Powers Product Decisions Without Sacrificing Privacy