
Processing 200,000 tokens through a large language model is expensive and slow: the longer the context, the faster the costs. Tsinghua University researchers and Z.ai Built a technique called IndexCache this reduces up to 75% of redundant computation in sparse attention models, giving up to 1.82x faster time to first cue and 1.48x faster generation throughput in context length.
The technique applies to models using the DeepSeek Sparse Attention architecture, including the latest DeepSeek and GLM families. It can help enterprises deliver a faster user experience for production-scale, long-context models, a capability already proven in initial testing on a 744-billion-parameter GLM-5 model.
DSA bottleneck
Large language models rely on a self-awareness mechanism, in which the model calculates the relationship between each token in its context and previous ones and predicts the next token.
However, self-focus has a serious limitation. Its computational complexity scales quadratically with the sequence length. For applications that require extended context windows (such as large document processing, multi-step agent workflows, or long reasoning chain reasoning), this quadratic scaling leads to slow inference speed and significant computational and memory costs.
Infrequent attention it offers a principled solution to the scale problem. Rather than calculating the relationship between each token and its predecessors, sparse attention optimizes the process by selecting each query and connecting only the most relevant subset of tokens.
Attention DeepSeek (DSA) is a highly efficient implementation of this concept, introduced for the first time DeepSeek-V3.2. DSA introduces a lightweight to determine which tokens are more important "lightning index module" in each layer of the model. This indexer evaluates all previous tokens and selects a small batch for the main attention mechanism to process. By doing so, DSA dramatically speeds up the model while maintaining output quality by reducing the heavy focus computation from quadratic to linear.
But the researchers discovered a long-standing flaw: the DSA indexer still operates at quadratic complexity per layer. Although the indexer is computationally cheaper than the main focus process, the time the model spends running these indexers increases exponentially as the context length increases. This slows down the model significantly, especially during startup "prefill" the stage at which the request is first processed.
Focus caching with IndexCache
To solve the indexing bottleneck, the research team discovered an important feature of how DSA models process data. The subset of important symbols selected by the indexer remains fairly constant as the data moves between successive transformer layers. Empirical tests on DSA models revealed that adjacent layers share between 70% and 100% of the selected tokens.
Researchers developed IndexCache to take advantage of this cross-layer cache. The technique divides the layers of the model into two categories. A small number of full (F) layers maintain their indexers, actively evaluating tokens and selecting the most important ones for caching. The remaining layers are shared (S) without performing any indexing and reusing cached indexes from the nearest F layer.
During inference, the model simply checks the type of the layer. If it reaches layer F, it calculates and caches the fresh indices. If it’s an S layer, it skips the math and copies the cached data.
There are a wide variety of optimization techniques that attempt to overcome the attention bottleneck KV cache compressionwhere the calculated focus values are stored. Instead of reducing the memory footprint like standard KV cache compression, IndexCache attacks the computational bottleneck.
“IndexCache is not a traditional KV cache compression or sharing technique,” paper co-author Yushi Bai told VentureBeat. “It removes this redundancy by reusing indexes between layers, reducing not just the memory footprint but the computation. It complements and can be combined with existing approaches.”
The researchers developed two deployment approaches for IndexCache. (Note that IndexCache only applies to models that use the DSA architecture, such as the latest DeepSeek models and the latest family of GLM models.)
For developers working with off-the-shelf DSA models where retraining is impossible or too expensive, they created an untrained method based on a “greedy layer selection” algorithm. By running a small calibration data set through the model, this algorithm automatically determines the optimal placement of the F and S layers without any weight updates. Empirical evidence shows that the greedy algorithm can safely remove 75% of the indexers while matching the downstream performance of the original model.
For teams that pre-train or heavily tune their underlying models, the researchers propose a training-aware version that optimizes network parameters to natively support cross-layer sharing. This approach introduces a “multi-layer distillation loss” during training. This forces each stored indexer to learn how to select a consensus subset that will be highly relevant for all subsequent layers it serves.
Real-world acceleration in production models
To test the effect of IndexCache, the researchers applied it to 30 billion parameters. GLM-4.7 Flash comparison with model and standard base.
Removing 75% of indexers at a context length of 200K reduced the prefetch latency from 19.5 seconds to just 10.7 seconds, a 1.82x speedup. These speeds are expected to be even greater in longer contexts, the researchers note.
During the decoding phase, where the model generates its own response, IndexCache increased the throughput of each query by 1.48x, increasing the throughput from 58 tokens per second to 86 tokens per second on 200K context tokens. When the server’s memory was completely filled with requests, the overall decryption throughput increased to 51%.
For enterprise teams, these efficiency gains translate directly into cost savings. “In terms of ROI, IndexCache provides consistent benefits across scenarios, but the gains are most noticeable in long-context workloads such as RAG, document analysis, and agent pipelines,” said Bai. “In these cases, we see at least about a 20% reduction in deployment cost and similar improvements in user-perceived latency.” He added that for very short contextual tasks, the benefits are around 5%.
Interestingly, these efficiency gains did not compromise thinking ability. Using an untrained approach to remove 75% of the indexers, the 30B model matched the average score of the original baseline on the long context criteria, scoring 49.9 versus the original’s 50.2. On the highly complex AIME 2025 math reasoning benchmark, the optimized model outperformed the original baseline with a score of 92.6 compared to 91.0.
The team also conducted initial experiments on a production-scale GLM-5 model with 744 billion parameters. They found that removing 75% of their indexers with an untrained method yielded at least a 1.3x speedup on contexts over 100K tokens. At the same time, the model maintained almost the same quality average in long context tasks.
Putting IndexCache into production
For development teams looking to implement a no-training approach today, the process is simple but requires careful setup. Although the greedy search algorithm automatically finds the optimal layer configuration, the quality of this configuration depends on the data it processes.
“We recommend using domain-specific data as a calibration set to adapt the discovered layer sharing pattern to real workloads,” Bai said.
Once calibrated, the optimization is highly available for production environments. Open source patches are now available Available on GitHub for main service engines. “Integration is relatively simple – developers can apply the patch to existing result stacks such as vLLM or SGLang and enable IndexCache with minimal configuration changes,” Bai said.
While IndexCache immediately addresses today’s computational challenges, its underlying philosophy points to a broader shift in the way the AI industry approaches model design.
“Future foundation models will likely be built with downstream constraints in mind from the start,” Bai said. “This means designs that are not only scalable in terms of model size, but also optimized for real-world throughput and latency, rather than treating them as post-hoc concerns.”




