
Long-term thinking reveals a fundamental weakness in AI agents: context windows fill up quickly, and search pipelines produce noise instead of signal.
To solve this, researchers from the National University of Singapore have developed MRAgentframework that rejects static "then take cause" approach. Instead, it uses a mechanism that allows the agent to dynamically evolve its memory based on the evidence it collects.
This multistage memory reconstruction is integrated into the reasoning process of the large language model (LLM). Although not the only framework in this space, MRAgent significantly reduces token consumption and runtime costs compared to other agent memory management approaches.
Limits of passive search in long-horizon tasks
In classical search pipelines, documents are retrieved via vector search or graph traversal and fed to LLM for reasoning. This passive approach fails because it fails to integrate reasoning with memory access, creating three major bottlenecks:
-
These systems cannot revise the search strategy on average. If an agent retrieves a document and discovers an important missing marker—a specific date or person—he has no option to issue a new request based on that finding.
-
Fixed similarity scores and predefined graph extensions return surface-level matches that fill LLM’s context window with unnecessary noise, degrading reasoning.
-
Existing systems rely heavily on pre-built structures such as top-k results and static matching functions, which limit the flexibility required to scale across unpredictable, long-horizon user interactions.
To overcome these limitations, developers must engage in an “active and associative restructuring process,” a concept inspired by cognitive neuroscience, the researchers argue.
Under this paradigm, memory recall evolves sequentially rather than operating as a passive read of a static database. The system starts with small, specific triggers from a user request, such as a person’s name, action, or location. These initial guidelines point to combining concepts or categories instead of mass blocks of text.
By following these metadata steps, the agent collects small pieces of evidence one by one. He uses each new piece of information to guide his next step until he has successfully pieced together a complete, accurate story.
How MRAgent performs active memory reconstruction
Instead of treating memory as a static database, MRAgent (Memory Reasoning Architecture for LLM Agents) treats it as an interactive environment. When processing a complex query, the agent uses the reasoning capabilities of the underlying LLM to explore multiple candidate search paths across a structured memory graph.
At each step, LLM evaluates the intermediate evidence it collects and uses it to re-optimize its search. It extracts new search constraints, follows paths with the best information, and prunes irrelevant branches. This allows MRAgent to bring together deeply buried information without flooding the context of the LLM with noise.
To make this active exploration computationally efficient and scalable, the framework organizes the database using the “Cue-Tag-Content” mechanism. It works as a multilayer associative graph with three node types:
-
Requests: Fine-grained keywords such as entities or contextual attributes extracted from user interactions.
-
Content: The actual storage units. These are divided into multi-granular layers such as episodic memory for specific events and semantic memory for fixed facts and user preferences.
-
Tags: Semantic bridges that summarize the relational associations between specific Directions and Content.
This structure allows for a highly efficient two-step search process. LLM first moves from Cues to candidate Tags. Because tags explicitly expose the semantic relationships and structural associations of data, the agent evaluates these summaries to assess their relevance. LLM identifies promising paths and discards irrelevant branches before spending compute and operational tokens to access detailed, memory-heavy contents.
For example, a user can ask an AI agent: "How did Nate use his prize money when he won his third video game tournament?"
-
MRAgent first extracts fine-grained start tokens from the query, e.g "Nate," "video game tournament," and "the winner"
-
The agent maps these initial tokens to the memory graph and looks at the available associative Tags attached to them. Sees tags like an agent "Tournament Victory" and "Participation in the tournament.” Since it only cares what the person did after winning the championship, MRAgent discards the tournament participation tag and chases the victory tag.
-
The agent retrieves the episodic content associated with the selected Cue-Tag pair, retrieving the three different memory episodes in which Nate won the tournament.
-
The MRAgent looks at three memories, decides that one matches the request, and rejects the other two.
-
With this information, it updates its markings and begins another round of discovery and pruning. From the new episodic memory it receives, the agent adds “tournament gains” to its tokens and uses this to pass new tags and acquire new memories. It repeats this process until it has gathered enough information to answer a query, which might be something like “Nate saved money.”
MRAgent performance on industry benchmarks
MRAgent works alongside a number of other frameworks focused on agent storage. Alternatives included A-MEMa graph-based agent memory framework and MemoryOS, a hierarchical memory framework. Other persistent memory frameworks include LangMem and Mem0.
The researchers tested MRAgent on the LoCoMo and LongMemEval industry benchmarks. These test the ability of agents to resolve long-horizon tasks and queries across dozens of sessions and hundreds of dialogue turns. The main models used were Gemini 2.5 Flash and Claude Sonnet 4.5. The system has been tested against standard RAG, A-MEM, MemoryOS, LangMem and Mem0.
MRAgent consistently outperformed each baseline by a significant margin across both models and all question types.
However, the most critical metric for enterprise developers is often computing cost. In the LongMemEval tests, MRAgent quickly reduced token consumption to just 118k per instance. In comparison, A-Mem consumed 632k tokens and LangMem burned 3.26 million tokens per request. MRAgent also halved its runtime compared to A-Mem, from 1122 seconds to 586 seconds.
What makes MRAgent effective in practice is its on-demand behavior. Evaluating tags before searching and pruning irrelevant paths saves money and context space. In addition, the system autonomously evaluates its accumulated context and essentially knows when to stop searching, completely avoiding the exploration of unnecessary data.
Implementation and development
Although the MRAgent is highly efficient, the Cue-Tag-Content structure must be prepared before the agent can query it. Developers need to learn how to architect the main memory database to allow LLM to efficiently handle associative elements and truncate inconsistent paths without exploding computational overhead.
Fortunately, developers don’t have to manually label or structure this data. The authors designed MRAgent with an automated distillation pipeline that uses LLMs to process raw interaction histories and automatically populate the memory graph. The job for the developer is to implement and orchestrate this automated ingest pipeline rather than tagging the data manually.
To extract this metadata before storing it in your graph database, you need to set up a background job or stream pipeline that passes user interaction through operational templates.
However, the authors emphasize that this is a lightweight build phase and that MRAgent intentionally keeps adoption simple.
The authors released the code GitHub.





