Companies looking to deploy AI reasoning models often face a harsh reality: training these systems demands computing power that most engineering teams simply don't have access to. Traditional approaches either rely on costly distilled knowledge from massive models or reinforcement learning techniques that offer minimal feedback, leaving businesses stuck between performance and affordability.
Researchers from JD.com and several academic partners have developed a groundbreaking training method that addresses this challenge. Their approach, called Reinforcement Learning with Verifiable Rewards with Self-Distillation (RLSD), merges the reliability of reinforcement learning with the detailed feedback from self-distillation. Early experiments show that models trained with RLSD significantly outperform those built using conventional distillation or reinforcement learning techniques, while drastically reducing the computational resources required.
Why traditional training methods fall short
The most widely used method for training reasoning models, Reinforcement Learning with Verifiable Rewards (RLVR), operates on a simple premise: the model learns through trial and error, with an automated verifier assessing the final outcome as either correct or incorrect. While this provides a clear performance signal, the feedback is binary and uniformly distributed across every token in the reasoning chain.
"Standard GRPO has a signal density problem," explains Chenxu Yang, co-author of the research paper. "A reasoning trace spanning thousands of tokens receives just a single binary reward. Every token in that sequence—whether it's a critical logical step or a filler phrase—gets the same credit or penalty. This makes it impossible for the model to identify which intermediate steps contributed to success or failure."
An alternative approach, On-Policy Distillation (OPD), attempts to solve this by pairing a smaller student model with a larger, more capable teacher model. During training, the student compares its responses to the teacher's output token by token, receiving granular feedback on its entire reasoning process. However, this method introduces substantial computational overhead.
"You need to keep the larger teacher model active throughout training, which roughly doubles your GPU usage," Yang notes. "Additionally, both models must share the same vocabulary structure, which effectively excludes most cross-architecture, cross-modality, or multilingual setups that enterprises commonly require."
The pitfalls of self-distillation
On-Policy Self-Distillation (OPSD) emerged as a potential solution by eliminating the need for a separate teacher model. In this method, the same model acts as both student and teacher. During training, the student receives a standard prompt while the teacher version receives privileged information—such as a verified step-by-step solution. The teacher then evaluates the student's performance token by token, providing detailed feedback.
At first glance, OPSD seems like the ideal compromise for budget-conscious teams. It delivers the granularity of OPD while maintaining the computational efficiency of RLVR, requiring only an additional forward pass for the teacher model.
However, the researchers discovered a critical flaw: "privileged information leakage." Yang explains, "The objective is fundamentally ill-posed. There's an unbridgeable mutual-information gap that the student can never overcome." When the training objective forces the student to mimic the teacher's exact phrasing or steps—rather than the underlying reasoning logic—the model starts generating references to solutions it will never access in real-world scenarios. This leads to rapid initial performance gains followed by a sharp plateau and eventual decline in reasoning capabilities.
How RLSD transforms training efficiency
The RLSD framework addresses these limitations by decoupling the direction of model updates from their magnitude. The researchers identified that the signal guiding the update direction (whether to reinforce or penalize a behavior) must be sparse yet perfectly reliable, as incorrect direction can severely damage the model's reasoning policy. Conversely, the signal determining the magnitude of updates benefits from being dense, enabling precise, step-by-step corrections.
In RLSD, the verifiable environmental feedback from RLVR strictly controls the update direction. The model only receives overall reinforcement when its final answer is objectively correct. The teacher's token-by-token assessment is repurposed to determine the magnitude of updates, distributing credit or blame across individual steps in the reasoning path rather than dictating what the model should generate.
This innovation enables enterprises to build custom reasoning models with significantly reduced computational requirements. By eliminating the need for resource-intensive teacher models and addressing the flaws of self-distillation, RLSD opens new possibilities for deploying advanced AI solutions across industries without breaking the bank.
The next generation of AI reasoning models may soon be within reach for businesses of all sizes, thanks to techniques that prioritize both performance and practicality.
AI summary
JD.com ve akademisyenler tarafından geliştirilen RLSD yöntemi, şirketlerin özel akıl yürütme modellerini %80 daha az hesaplama gücüyle oluşturmasını sağlıyor. Nasıl çalıştığını ve avantajlarını keşfedin.


