How IndexCache Cuts DeepSeek’s Sparse Attention Costs by 75%

DeepSeek Sparse Attention (DSA) promised to cut attention costs by focusing only on the most relevant tokens instead of the entire context. Yet, behind this breakthrough lurked an overlooked bottleneck: the lightweight indexer that selects these tokens still runs a full O(L²) operation at every layer. With multiple transformer layers, this adds up to O(NL²) complexity — negating DSA’s gains for long contexts.

A new paper from Tsinghua University and Z.ai introduces IndexCache, a simple yet effective fix that eliminates up to 75% of this indexer overhead without altering model architecture or increasing memory usage.

The Hidden Cost of Sparse Attention

DSA reduces attention from O(L²) per layer to O(Lk) by selecting only the top-k tokens for actual attention computation. However, the indexer responsible for identifying those tokens must evaluate every prior token in the sequence, resulting in O(L²) complexity at each of the N layers. While cheaper per operation than full attention, running this across all layers still creates a significant computational burden.

"The indexer is cheap per-FLOP but still runs at every single layer, turning it into the dominant cost at long context lengths," the researchers note.

Reusing Index Results Across Layers

The breakthrough insight came from observing that adjacent transformer layers often select nearly identical token sets. Measurements show 70–100% overlap in top-k selections between neighboring layers, with even stronger consistency within layer clusters. This redundancy points to unnecessary repeated computation.

IndexCache introduces a straightforward two-role system:

Full layers run the indexer normally and cache the resulting top-k token set.
Shared layers skip the indexer entirely, reusing the most recent cache instead.

The first layer always acts as a full layer to initialize the cache. Subsequent layers dynamically choose between full and shared roles based on a pre-determined pattern, effectively replacing O(NL²) indexer operations with O(Nk) cache lookups.

The inference loop transforms from:

for layer in range(1, N + 1):
    index = indexer(layer, tokens)
    top_k = select_top_k(index)
    tokens = sparse_attention(tokens, top_k)
    tokens = feed_forward(tokens)

to a more efficient version:

cache = None
for layer in range(1, N + 1):
    if is_full_layer(layer):
        index = indexer(layer, tokens)
        top_k = select_top_k(index)
        cache = top_k  # Cache the result
    else:
        top_k = cache  # Reuse cached result
    tokens = sparse_attention(tokens, top_k)
    tokens = feed_forward(tokens)

This approach adds no extra GPU memory beyond what standard DSA requires, as the cache simply stores the current index tensor which gets overwritten at each full layer.

How the Indexer Works

The actual indexer mechanism remains unchanged from DSA’s design, focusing on efficiency through a lightweight scoring system:

Each query token gets scored against all candidate positions using a compressed representation from Multi-head Latent Attention (MLA).
Scores are passed through a ReLU activation to filter out negative values.
The top-k positions are selected without full softmax normalization, reducing computational overhead.

MLA plays a crucial role here by compressing multi-head key-value pairs into a single low-rank latent vector, enabling the indexer to operate on a much smaller representation than the full attention mechanism.

Determining Full vs. Shared Layer Patterns

The paper explores two approaches to identify which layers should remain full versus shared:

Training-Free Greedy Search

This method requires no model retraining and works entirely through evaluation:

Start with all layers marked as full.
Systematically test converting each full layer to shared, measuring the impact on language model loss.
Always select the conversion that causes the smallest loss increase.
Repeat until reaching the target number of shared layers (e.g., converting 75% of layers).

The approach leverages a small calibration dataset from training batches, ensuring consistent evaluation while avoiding data-dependent noise. Empirical results on a 30 billion parameter DSA model show:

The greedy pattern consistently outperforms uniform interleaving at the same retention ratio.
The loss curve reveals a clear pattern: early layers are easier to convert, while later layers show steeper loss increases.
The identified layer importance ranking remains stable across different calibration datasets.

Why Simple Patterns Fail

A naive uniform alternation (e.g., full-shared-shared-shared repeating) performs poorly because indexer importance varies significantly across layers. Critical early and transitional layers cannot be shared without substantial accuracy loss, while many deeper layers tolerate caching well. A fixed pattern cannot account for this intrinsic layer sensitivity.

Practical Implications and Future Directions

IndexCache demonstrates how careful analysis of computational bottlenecks can yield substantial efficiency gains with minimal architectural changes. By eliminating redundant indexer operations, the technique enables longer context processing without proportional increases in compute costs.

Looking ahead, researchers may explore dynamic patterns that adapt to input context or integrate IndexCache principles into other sparse attention variants. The method’s simplicity suggests potential for widespread adoption across transformer architectures where attention sparsity plays a key role.

AI summary

DerinSeek’in seyrek dikkat modelindeki gizli darboğazı çözen IndexCache teknolojisi hakkında detaylı bilgi edinin. Hesaplama verimliliğini artıran bu yöntemle modellerinizi daha hızlı çalıştırın.

How IndexCache Cuts DeepSeek’s Sparse Attention Costs by 75%

The Hidden Cost of Sparse Attention

Reusing Index Results Across Layers

How the Indexer Works

Determining Full vs. Shared Layer Patterns

Training-Free Greedy Search

Why Simple Patterns Fail

Practical Implications and Future Directions

Comments

Glaze delivers native WebView apps in Go without CGo

How Co-locating Code and Data Boosts App Performance by 450%

Why 'Orthogonal Thinking' is the Skill Every Engineer Needs