The new framework allows AI agents to rewrite their own skills without retraining the underlying model



One of the main challenges in implementing autonomous agents is building systems that can adapt to changes in their environment without the need to retrain the underlying large language models (LLMs).

Memento-SkillsA new framework developed by researchers at multiple universities addresses this bottleneck by allowing agents to develop their own skills. "Adds itself continuous learning Enable current market offerings such as OpenClaw and Claude Code," Jun Wang, co-author of the newspaper, told VentureBeat about it.

Memento-Skills act as an evolving external memory, allowing the system to incrementally improve its capabilities without changing the base model. The framework provides a set of skills that can be updated and extended as the agent receives feedback from its environment.

This is important for enterprise teams that employ agents in production. The alternative—fine-tuning or manual skill building of model weights—carries significant operational costs and data requirements. Memento-Skills set both aside.

Challenges of building self-developing agents

Self-evolving agents are critical because they overcome the limitations of frozen language models. Once a model is deployed, its parameters remain fixed, limiting it to the knowledge encoded during training and whatever fits within its immediate context window.

Giving the model an external memory scaffold allows it to be upgraded without the expensive and slow retraining process. However, existing approaches to agent adaptation are largely based on hand-crafted skills to handle new tasks. Although some automatic skill learning methods exist, they mainly produce text-only instructions that lead to operational optimization. Other approaches simply record single-task trajectories that do not transfer between different tasks.

Furthermore, when these agents attempt to acquire relevant knowledge for a new task, they typically rely on semantic similarity routers such as standard dense inputs; high semantic overlap does not guarantee behavioral utility. An agent relying on standard RAG a. can get "password reset" script to solve a "reverse processing" simply query because the documents share enterprise terminology.

"Most search-augmented generation (RAG) systems are based on similarity-based search. However, when skills are presented as executable artifacts such as logs or code snippets, similarity alone may not select the most effective skill." Wang said.

How Memento-Skills stores and updates skills

To address the limitations of existing agent systems, the researchers built Memento-Skills. The paper describes the system as “a generic, continuously learnable LLM agent system that acts as an agent design agent.” Instead of keeping a passive record of past conversations, Memento-Skills creates a series of skills that act as a continuous, evolving external memory.

These skills are stored as structured log files and serve as the agent’s evolving knowledge base. Each reusable skill artifact consists of three main elements. It contains declarative specifications of what a skill is and how it should be used. It includes specific instructions and guidelines that guide the reasoning of the language model. And here is the executable code and helper scripts that the agent runs to actually solve the task.

Memento-Skills achieves continuous learning through it "Literacy Reflective Learning" mechanism that frames memory updates as active policy iteration rather than passive data logging. When faced with a new task, the agent queries and executes a specific skill router to obtain the most behaviorally relevant skill, rather than the most semantically similar one.

After the agent executes the skill and receives feedback, the system displays the result to close the learning loop. Instead of appending a log of what happened, it actively mutates the system memory. If the execution fails, the orchestrator evaluates the trace and rewrites the skill artifacts. This means that it directly updates the code or offers to patch a specific failure mode. Creates an entirely new skill when needed.

Memento-Skills also updates the skills navigator through a one-step offline reinforcement learning process that learns from performance feedback rather than text overlays. "The true value of a skill lies in how it contributes to the overall agent workflow and downstream performance,” Wang said. “Reinforcement learning therefore provides a more appropriate framework as it allows the agent to evaluate and select skills based on long-term utility."

Automated skill mutations are protected by an automatic unit test gate to prevent regression in a production environment. The system creates a synthetic test case, runs it through the updated skill, and validates the results before saving the changes to the global library.

By constantly rewriting and improving its executables, Memento-Skills allows the frozen language model to build solid muscle memory and gradually expand its capabilities.

Testing a self-improving agent

The researchers evaluated Memento-Skills on two rigorous criteria. First Generic AI assistants (GAIA), which requires complex multistep reasoning, multimodality management, web browsing, and tool use. Secondly Humanity’s Final Examor HLE, is an expert-level benchmark covering eight different academic subjects, such as mathematics and biology. Powered by the entire system Gemini-3.1-Flash acts as a basic frozen language model.

The system was compared to a Read-Write database that acquires skills and collects feedback, but lacks self-development features. The researchers also tested their custom skill router against standard semantic search databases including BM25 and BM25. Qwen3 inserts.

The results proved that an actively self-developing memory far outperforms a static skill library. On the highly diverse GAIA benchmark, Memento-Skills improved test suite accuracy by 13.7 percentage points compared to the static baseline, reaching 66.0 percent versus 52.3 percent. In the HLE benchmark, where the domain structure allows for extensive cross-task skill reuse, the system more than doubled its baseline, going from 17.9% to 38.7%.

Moreover, Memento-Skills’ specialized skill redirector avoids the classic search trap of selecting an inappropriate skill simply because of semantic similarity. Experiments show that Memento-Skills increases task success rate up to 80% compared to just 50% for standard BM25 search.

The researchers observed that Memento-Skills drives this performance through highly organic, structured skill growth. Both benchmarks started with just five atomic seed skills, such as practice, basic web browsing, and terminal operations. In the GAIA benchmark, the agent transformed this seed group into a compact library of 41 skills to handle various tasks. In the expert-level HLE benchmark, the system dynamically scaled its library to 235 different skill levels.

Finding the enterprise sweet spot

The researchers released the code Memento-Skills on GitHuband ready for use.

For enterprise architects, the effectiveness of this system depends on aligning the domains. Rather than simply looking at benchmark scores, key business exchanges depend on whether your agents are handling isolated tasks or structured workflows.

"Transferability of skills depends on the degree of similarity between tasks," Wang said. "First, when tasks are isolated or weakly related, the agent cannot rely on previous experience and must learn through interaction." In such distributed environments, transfer between tasks is limited. "Second, when tasks share significant structure, previously acquired skills can be directly reused. Here, learning becomes more efficient because knowledge is transferred between tasks, allowing the agent to perform well on new problems with little or no additional interaction."

Given that the system requires repetitive task patterns to reinforce knowledge, business leaders need to know exactly where to deploy and store it today.

"Workflows are probably the most appropriate setting for this approach, as they provide a structured environment in which skills can be designed, evaluated and improved." Wang said.

However, he cautioned against over-deployment in areas not yet suitable for the framework. "Physical agents remain largely unexplored in this context and require further investigation. In addition, tasks with longer horizons may require more advanced approaches, such as multi-agent LLM systems, to ensure coordination, planning, and continuous execution over extended decision sequences."

As the industry moves toward agents that autonomously rewrite their production code, governance and security remain key. While Memento-Skills uses basic security rails such as automated unit test gates, enterprise adoption will likely require a broader framework.

"To enable reliable self-improvement, we need a well-designed evaluation or judging system that can assess performance and provide consistent guidance." Wang said. "Rather than allowing unfettered self-modification, the process should be framed as a controlled form of self-development in which feedback guides the agent toward better designs."



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *