Context compression finally works in production: new study reduces LLM input by 16x without hitting accuracy

Context windows become a computational bottleneck. The longer an agent runs, the more tokens it collects from fetched documents, reasoning traces, and chat history, and the more memory and computing it requires for increased context. Most existing solutions either reduce model accuracy, require the full context to be loaded before compression can begin, or create memory savings that do not translate into real speedups in a standard service infrastructure.

A research team from NYU, Columbia, Princeton, University of Maryland, Harvard and Lawrence Livermore National Laboratory published an article this week this suggests a new fix. The researchers introduce the concept of Latent Context Language Models, or LCLMs, a family of encoder-decoder compression models that compress the input context before it reaches the decoder. The models are open source at HuggingFace.

Unlike KV cache compression techniques—the dominant approach in the field, which still implements the full KV cache before extracting the inputs—LCLMs compress the input symbol sequence before the decoder is preloaded, so higher compression ratios directly reduce computation and memory on the decoder side. The paper reports that LCLMs achieve 8.8x faster access than KV cache benchmarks on the RULER long-context benchmark with 16x compression.

"These balloon contexts take up memory and computation and become a computational bottleneck for LLMs," Micah Goldblum, Columbia University researcher and lead consultant on the project, told VentureBeat. "Our goal was to train end-to-end language models that can handle very long contexts efficiently and accurately. If you can model a language like this, everything becomes cheaper and faster."

What LCLMs can do

LCLMs allow models to process longer contexts than would be practical at a fraction of the memory and computational cost, without the compromise of accuracy that makes most compression techniques a poor substitute in production.

At 4x compression, the paper reports 91.76% accuracy on the RULER benchmark, compared to 94.41% without any compression. This is less than a 3-point drop to crop the context to a quarter of its original size. Accuracy dropped to 75.06% at 16x compression, where 93.75% of input tokens were removed. Every KV cache method tested at the same compression ratio performed lower.

Earnings also hold shorter entries. In the GSM8K math word problems, where the full query was compressed rather than only the retrieved documents, LCLMs outperformed all other methods tested, regardless of the compression ratio.

How is it built?

The architecture pairs a 0.6B encoder with a 4B decoder. The encoder compresses blocks of input tokens into shorter sequences of hidden inputs. The decoder processes those in place of the original tokens. Training has passed over 350 billion tokens.

A training recipe combines three types of information:

Data before sustained training with compressed and non-compressed intervals
Controlling fine-tuning data involving reasoning and long contextual tasks
An auxiliary reconstruction task that forces the coder to retain fine details

The combination deals with a compromise where the preservation of the reconstruction accuracy comes at the cost of the overall task performance, which limits previous compression work.

Architectural search identified the optimal configuration. The paper found that scaling the decoder is more important than increasing the size of the encoder.

Where it fits into the agent stack

LCLM is not an abstract research concept. Designed to work with existing stack. "You can simply exchange LCLMs with any existing LLM," Goldblum said. "When you get data such as documents and want to move them into the context of your model, simply pass those documents through LCLM’s compressor first."

In the study, the researchers demonstrated how to build agents that selectively open useful text, he noted.

"Before zooming in on relevant details, think of it as a human content," Goldblum said.

Goldblum also cautioned that teams integrating the approach into existing agent pipelines should adjust their RAG systems accordingly.

"We have also not worked on online compression of reasoning traces," he said. "A naïve approach of occasionally compressing while creating the trail might work, but that remains to be determined."

What this means for businesses

Context windows are growing faster than the extraction infrastructure can keep up, and enterprises are already spending money to fix it. VB Pulse Q1 2026 survey data from more than 100 employer organizations shows that intent to adopt hybrid search tripled from 10.3% in January to 33.3% in March. Search optimization topped the rating as the highest investment priority by March, reaching 28.9% of qualified respondents.

Three things stand out for teams evaluating production conformance:

Inference scaling with context length. At 1 million tokens, the uncompressed result with standard KV cache methods runs out of memory on an H200 GPU. The document states that at 16x compression, LCLMs remain within memory limits at that context length.
RAG pipeline integration requires tuning. Teams with existing RAG pipelines should validate their compression behavior against search quality metrics before deploying at scale.
Cause trace compression is not resolved. For agents managing long reasoning chains, context augmentation from a trace is a separate problem from document retrieval. Goldblum directly acknowledged the gap: the naïve approach of compressing periodic traces might work, but it hasn’t been tested.

Models are available at huggingface.co/latent-context and code at github.com/LeonLixyz/LCLM.

"The biggest things our architectures do is give your model access to larger contexts, but they also open up multidimensional approaches where your model can skim through large amounts of text or code super-fast, and then only zoom in and fully read a small portion of the most useful text." Goldblum said.

Source link

Context compression finally works in production: new study reduces LLM input by 16x without hitting accuracy

What LCLMs can do

How is it built?

Where it fits into the agent stack

What this means for businesses

Leave a ReplyCancel Reply

5 ways the Razr Fold 2027 improves on Motorola’s already excellent foldable technology

Don’t miss out on this Ryzen 7 mini PC with 24GB RAM as it’s easily discounted.

3 Home Assistant dashboard projects to try this weekend (June 12

What LCLMs can do

How is it built?

Where it fits into the agent stack

What this means for businesses

Leave a ReplyCancel Reply

Trending now

5 ways the Razr Fold 2027 improves on Motorola’s already excellent foldable technology

Don’t miss out on this Ryzen 7 mini PC with 24GB RAM as it’s easily discounted.

3 Home Assistant dashboard projects to try this weekend (June 12