A parameter addition of 0.12% gives AI agents working memory that RAG cannot

AI agents forget. Every time a coding assistant loses track of a debug thread or a data analysis agent re-receives the same context it’s already processing, the team pays for latency, token costs, and fragile workflows. The fix that most teams have — expanding the context window or adding more RAGs — is increasingly expensive and still doesn’t work reliably.

To solve this, researchers from Mind Lab and several universities have proposed delta-meman efficient technique that compresses the model’s historical information into a dynamically updated matrix without changing the model itself. As a result, the module adds only 0.12% of the backbone model’s parameters – compared to 76.40% for a leading alternative – while outperforming it in memory-heavy benchmarks. Delta-mem allows models to continuously collect and reuse historical data, reducing reliance on massive context windows or complex external lookup modules for behavior persistence.

Long memory problem

The traditional solution is to simply dump all the data into the model’s context window.

But as the paper’s co-author Jingdi Lei told VentureBeat, existing systems treat memory simply as a context management problem. “Either we will continue to expand the context window, or we will get more documents through RAG,” Lei said. “These approaches are useful and will remain important, but they become increasingly expensive and fragile when agents have to work on long-term, multi-stage interactions, and they don’t like human memory (working) because they’re more like searching documents.”

In enterprise settings, the bottleneck is not just whether the model has access to history, but whether it can reuse that history efficiently, consistently, and with low latency. Standard attention mechanisms incur quadratic computational costs as sequence length increases. Furthermore, expanding the context window does not guarantee that the model will actually remember the information effectively. Models often suffer from context degradation or context decay while they support a million tokens in theory, they are overwhelmed with more (and often conflicting) data.

Researchers are advocating for advanced memory mechanisms that can compactly represent historical information and store it dynamically in interactions. Existing solutions come with heavy trade-offs and generally fall into three paradigms:

Text memory: stores history as contextualized text – limited by window constraints and prone to data loss during compression.
External channel (RAG): encodes and derives from external modules — adding latency, integration complexity, and potential incompatibility with the backbone.
Parametric: encodes memory to model weights via adapters — static after training, unable to adapt to new information during live interaction.

Inside the delta-mem

To achieve a compact and dynamically updated memory, delta-mem compresses an agent’s past interactions into an “online associative memory state” (OSAM). This state is stored as a fixed-size matrix that stores historical information while the underlying language model remains frozen.

For enterprise workflows, this translates directly into a solution to operational bottlenecks. Lei noted that a persistent coding assistant, for example, “may need to remember project conventions, recent debugging steps, user preferences, or intermediate workflow decisions.” Similarly, a data analysis agent “may need to store task state, assumptions, and previous observations while iterating over multiple tool calls.”

Instead of repeatedly retrieving and re-entering all the relevant history for these tasks, the delta-mem matrix provides a low-overhead way to carry forward useful interaction states within the forward computation of the model.

During generation, the system does not retrieve the raw text segments to add to the query. Instead, the current hidden state of the trunk LLM is projected into the matrix to retrieve the old memory. This operation extracts context-specific associative memory signals from delta-mem. These signals are then converted into numerical corrections that are applied to the model’s calculations. This justifies the inference time without changing the internal parameters of the model.

After each interaction, delta-mem updates the online state using “delta rule learning”. As new information arrives, the prior state predicts the resulting attentional values. It then compares this prediction with the actual value and adjusts the memory matrix based on the discrepancy.

This update mechanism is based on the “gated delta rule”. Basically, the memory module has various buttons that control how much of the previous memory is kept and how much of the new memory is applied. This error correction with supervised forgetting allows the matrix to evolve over time while retaining stable historical associations without being overridden by short-term noise.

The researchers explored three strategies for determining when and how to update the matrix:

Token state script captures fine-grained changes but is sensitive to short-term noise.
Writing sequence state verses within a message segment are averaged, smoothing updates at the expense of some localized details.
Multi-state writing divides memory into substates for different types of information, such as facts or task progress.

Delta-mem is active

The researchers evaluated delta-mem in three LLM backbones: Qwen3-8B, Qwen3-4B-Instruct, and SmolLM3-3B. They configured the frame with a compact 8×8 matrix. The system has been tested on general performance benchmarks including HotpotQA, GPQA-Diamond and IFEval. It has also been evaluated on memory-intensive tasks such as the LoCoMo, which tests long-term verbal memory, and the Memory Agent Bench, which assesses retention, retrieval, selective forgetting, and learning at test over long-term interactions.

The framework was compared with representative models of three existing memory paradigms: text memory baselines (e.g. BM25 RAG, LLMLingua-2 and MemoryBank), parametric systems (Context2LoRA and MemGen) and external channel approach MLP Memory.

According to the researchers, delta-mem outperformed the benchmarks. On the Qwen3-4B-Instruct backbone, the token-state write variant averaged 51.66%, easily beating the frozen vanilla backbone at 46.79% and the strongest baseline, Context2LoRA, at 44.90%. The average score on the memory-heavy Memory Agent Bench increased from 29.54% to 38.85%. Performance on the specific test-time learning subtask nearly doubled from 26.14 to 50.50.

However, the most impressive points are the operational efficiency of the system. The researchers tested the framework in a decontextualized setting where the historical text was completely taken out of context. Even without plaintext repetition, delta-mem successfully recovered context-related evidence in multi-hop tasks. The researchers claim that the model remembers past interactions without having to accept a large number of quick tokens.

The framework also adds only 4.87 million teachable parameters, representing only 0.12% of the Qwen3-4B-Instruct backbone. In comparison, the MLP Memory database required 3 billion parameters and grew to 76.40% of the backbone size with inferior results. When operational lengths reached 32,000 tokens during the resulting tests, the framework maintained nearly the same GPU memory footprint as a standard, unmodified model. Eliminates severe memory bloat that affects other advanced memory systems such as MemGen and MLP Memory.

Depending on the underlying model capacity, different update strategies have been useful. The sequence state writing strategy was most effective for stronger backbones such as Qwen3-8B. These more capable models use segment-level writing to correct updates and reduce signal-level noise. In contrast, the multi-state write strategy led to large performance leaps for small backbones such as the SmolLM3-3B. For these low-capacity models, allocating memory to multiple states was critical to minimize information interference.

Implementing delta-mem in the enterprise stack

The researchers announced code for delta-mem GitHub and weights for their trained adapters In a Hugging Face. For AI engineering teams looking to integrate this framework into their existing result stack, the process requires minimal computing resources.

“In practice, the engineering team will start from an existing instruction-tuned backbone, connect Delta-Mem adapter modules to selected attention layers, train the adapter parameters on only domain-relevant multi-cycle or long-context data… and then output with a memory state that is updated online during interactions,” Lei said. The bottom line is that teams don’t need a massive draft corps. The training data should only reflect the target memory behavior, such as multi-loop dialogs, agent traces, or domain workflows where prior information should influence subsequent decisions.

Although compressing the interaction history into a fixed-size mathematical matrix creates great efficiency, it comes at a cost. Delta-mem is not a lossless replacement for plain text records or document retrieval. Because different pieces of information compete within the same bounded state, there is a risk of memory confusion.

“Delta-Mem is useful when a system needs a fast, online, continuously updated behavioral state,” Lei said. “RAG is better when the system requires accurate factual recall, citation, compliance, auditability, or access to a large external knowledge base.” Remembering a user’s work style or multi-step thinking trajectory is perfectly suited for a delta-mem, whereas retrieving a legal contract or medical manual should remain in a vector database.

This means that the most realistic enterprise architecture moving forward is a hybrid approach. Delta-mem acts as lightweight internal working memory, reducing the need to fetch or redo everything all the time, while RAG serves as an open, high-capacity memory layer.

“Looking forward, I don’t think vector databases will become obsolete,” Lei said. “Instead, I expect enterprise AI stacks to become more layered. We’ll likely see short-term working memory within the model, longer-term open memory in search engines, and layers of policy or auditing that decide what is kept, retrieved, forgotten, or presented to the user.”

Source link

A parameter addition of 0.12% gives AI agents working memory that RAG cannot

Long memory problem

Inside the delta-mem

Delta-mem is active

Implementing delta-mem in the enterprise stack

Leave a ReplyCancel Reply

Three main vital signs make up the “urban pulse” of a city

Laife’s First Daily Drop is the Personal Grooming Event of the Season, Now Up to 40% Off

Waymo says it has created a better benchmark to compare robotaxis to humans

Long memory problem

Inside the delta-mem

Delta-mem is active

Implementing delta-mem in the enterprise stack

Leave a ReplyCancel Reply

Trending now

Three main vital signs make up the “urban pulse” of a city

Laife’s First Daily Drop is the Personal Grooming Event of the Season, Now Up to 40% Off

Waymo says it has created a better benchmark to compare robotaxis to humans