How to build individual thinking agents with a fraction of the computation



Training AI reasoning models requires resources that most enterprise teams do not have. Engineering teams are often forced to choose between distilling knowledge from large, expensive models or relying on reinforcement learning techniques that provide sparse feedback.

JD.com and researchers at several academic institutions recently introduced a new training paradigm that circumvents this dilemma. The technique is called Reinforcement learning with verifiable rewards through self-distillation (RLSD), combines the reliable performance tracking of reinforcement learning with the granular feedback of self-distillation.

Experiments show that models trained with RLSD are superior to models built on classical distillation and reinforcement learning algorithms. For enterprise teams, this approach lowers the technical and financial barriers to building custom reasoning models tailored to specific business logic.

The problem with teaching mental models

It is a standard way to teach thought patterns Reinforcement Learning with Verifiable Rewards (RLVR). In this paradigm, the model learns through trial and error, guided by the final output from its environment. An automated validator checks whether the model’s answer is true or false, providing a binary reward of 0 or 1.

RLVR suffers from sparse and uniform feedback. “Standard GRPO has a signal density problem,” Chenxu Yang, co-author of the paper, told VentureBeat. “A multiple reasoning trace gets a single binary reward, and every token within that trace gets the same credit, whether it’s a basic logic step or an empty expression.” Consequently, the model never learns which intermediate steps lead to its success or failure.

A distillation of politics (OPD) takes a different approach. Instead of waiting for the final result, developers combine a smaller student model with a larger, more skilled teacher model. For each training example, the student compares his answer with the teacher’s mark. This provides the student with detailed feedback on the entire reasoning chain and response generation process.

Deploying and running a separate, massive teacher model along with the student throughout the entire learning process incurs huge computational costs. “You have to maintain a larger teacher model population throughout training, which roughly doubles your GPU footprint,” Yang said. Furthermore, teacher and student models must share the same vocabulary structure, which, according to Yang, “quietly rules out the multi-architecture, cross-modality, or multilingual structures that institutions actually manage.”

The promise and failure of self-distillation

On-Policy Self-Distillation (OPSD) emerged as a solution designed to overcome the shortcomings of the other two approaches. In OPSD, the same model plays the role of both student and teacher.

During training, the student receives a standard survey, while the teacher receives privileged information such as a verified, step-by-step answer key. This well-informed teacher version of the model then evaluates the student version and provides token-token feedback as the student attempts to solve the problem using only standard instruction.

OPSD seems like the perfect compromise for an enterprise budget. It provides a detailed, step-by-step guide to OPD. Because it eliminates the need for an external teacher model, it operates with high computational efficiency and low cost of RLVR, requiring only an additional forward pass for the teacher.

However, the researchers found that OPSD suffers from a phenomenon called “privileged information leakage.”

“The goal is structurally poorly set,” Yang said. “There is an irreducible mutual information gap that the student can never bridge. . . . When self-distillation is framed as distributive conformity, the student is asked to imitate the teacher’s full product distribution in a privileged context.”

Because the teacher evaluates the student based on a secret answer key, the learning objective forces the student model to learn the teacher’s exact statement or steps instead of the underlying logic. As a result, the student model begins to hallucinate references to an invisible solution that it cannot obtain in real-world placement.

In practice, OPSD models show a rapid increase in performance at the beginning of training, but their reasoning abilities improve in the short term and gradually deteriorate over time.

Separation of direction from magnitude with RLSD

The researchers behind RLSD realized that the signals that govern how a model updates its parameters have fundamentally asymmetric requirements. They found that the signal dictating the direction of the update (i.e., reinforcing or punishing the behavior) can be sparse, but must be completely reliable, because biasing the model in the wrong direction undermines its reasoning policy.

On the other hand, the signal that dictates the magnitude of the update (ie, how much relative credit or blame a particular step deserves) benefits from being extremely dense to enable fine-grained, step-by-step adjustments.

RLSD builds on this principle by separating the update direction from the update scale. The framework allows the verifiable environmental feedback from the RLVR signal to rigorously determine the direction of learning. The model receives general reinforcement only when the final answer is objectively correct.

By itself, the teacher has no authority to dictate what the model should create. Instead, a token-by-token evaluation of the teacher is recalculated to determine the extent of the update. It simply distributes overall credit or blame across the individual steps of the model’s reasoning path.

This changes how the model learns compared to the classical OPSD paradigm. In standard OPSD, the goal of training acts as behavioral cloning, where the model is forced to directly copy the teacher’s exact wording and expression. This causes the student to hallucinate and leak references to information he doesn’t have.

Instead of forcing the model to copy the hidden solution, RLSD provides a natural and virtually free source of each token’s credit information.

“Intuition: we don’t teach the model to think like a teacher,” Yang said. “We tell the model which of its cues are actually doing the work on its chosen path. The model’s intelligence distribution remains unique. Only the credit allocation is sharpened.”

A particular deduction receives a higher score if it strongly supports the correct conclusion. If it’s just a useless filler word, it gets major points. RLSD eliminates the need to train complex auxiliary reward networks, manually interpret step-by-step data, or maintain massive external teacher models.

Testing the RLSD

To test RLSD, the researchers trained the open-weight Qwen3-VL-8B visual language model and evaluated it on several visual reasoning criteria. These include MMMU, MathVista, MathVision, WeMath for college-level multidisciplinary questions, and ZeroBench, which stress-tests benchmarks expressly designed to be nearly impossible for current frontier models.

They compared the RLSD model with the baseline model without any training, standard RLVR, standard OPSD, and a hybrid combination of both via the GRPO algorithm.

RLSD significantly outperformed all other methods, achieving the highest average accuracy of 56.18% across all five criteria. It outperformed the base model by 4.69% and beat the standard RLVR by 2.32%. The gains were most pronounced on complex mathematical reasoning tasks, where RLSD outperformed standard RLVR by 3.91% on the MathVision benchmark.

In addition to accuracy, the framework offers huge efficiency gains. “Specifically, at 200 training steps, RLSD already outperforms GRPO trained for 400 steps, so about 2x convergence speed,” Yang said. “In terms of cost, the only overhead outside of the normal GRPO pipeline is one extra forward pass per response to capture teacher logits. Compared to broadcast generation… it’s basically free.”

Unlike OPSD, which saw a performance increase and then collapsed completely due to data leakage, RLSD maintained long-term training stability and converged on a higher performance ceiling than standard methods.

Qualitative results emphasize how the model changes learning behavior. For example, in a complex visual computation task, standard RLVR looks at the last correct answer and assigns the same reward to the entire paragraph of reasoning cues. RLSD applies rewards to specific mathematical subtraction steps that surgically solve the problem, while actively reducing the overall filler text. "Looking at the picture, I see…".

In another example, the model performed an incorrect mathematical derivation based on a bar chart. Instead of marking the entire response as a failure, RLSD concentrated the most severe penalty on the point where the model misread the relationship from the diagram. Recognizing that the initial frame was valid, the logic remained neutral throughout the rest of the setup.

This is especially important for messy, real-world enterprise use cases. If the model makes a mistake analyzing a 50-page quarterly earnings report, developers don’t want it to learn the entire analytics framework. They just want to correct a particular assumption that is wrong. RLSD allows the model to learn exactly which logic jumps are valid and which are defective. Because RLSD does this by modifying the model itself, it provides models with granular reasoning capabilities while keeping training costs reasonable.

How businesses can start

For data engineers and AI management teams, integrating RLSD is simple, but it requires proper setup. The most critical requirement is a reward signal that can be checked by code compilers, math verifiers, SQL execution or schema validators. “Tasks without a verifiable reward (open dialogue, brand voice recording) belong to preference-based pipelines,” Yang said.

However, the RLSD is quite flexible about the privileged information it requests. While OPSD structurally requires full intermediate reasoning traces, forcing entities to pay annotators or distill from a boundary model, RLSD does not.

“If you have fully validated reasoning traces, great, RLSD will use them,” Yang said. “If all you have is a conclusive answer to the ground truth, that works…OPSD doesn’t have that flexibility.”

Integrating the technique into existing open source multimodal RL frameworks such as veRL or EasyR1 is incredibly lightweight. According to Yang, it doesn’t require any frame rewriting and goes directly into the default stack. Changing the code simply involves changing dozens of lines to adjust the GRPO goal and synchronize the teacher with the student.

Looking ahead, RLSD offers businesses a powerful way to maximize their existing internal assets.

“The private information that enterprises keep within their perimeters (compliance guidelines, internal documents, historical tickets, verified code snippets) is essentially free privileged information,” Yang said. “RLSD allows enterprises to feed such data directly into the privileged context, which sharpens the learning signal in smaller models without the need for an external teacher and without sending anything outside the network.”



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *