Teams that hardcode LangChain pipelines soon hit a wall when user queries change—an inevitability in real-world apps. Sakana AI tackles this rigidity with RL Conductor, a compact reinforcement-learning model that dynamically orchestrates top-tier LLMs to handle shifting workloads efficiently.
The cost of static agent pipelines
LangChain, Mixture-of-Agents, and similar frameworks work well in controlled demos, but crumble under diversity. When applications serve millions with unpredictable needs, hard-coded workflows become bottlenecks. Yujin Tang, co-author of the RL Conductor paper and Sakana AI researcher, told VentureBeat: “Human-designed pipelines excel in narrow scenarios, yet they fail to generalize across heterogeneous user demands that dominate production systems.”
A second flaw lies in model specialization. No single LLM dominates every task: one excels at scientific reasoning, another at code generation, a third at high-level planning. Statically pairing models to queries is impractical, especially as user needs evolve. Tang emphasized that “optimal orchestration must adapt to each input rather than lean on fixed assignments.”
Orchestrating agents like a conductor
RL Conductor replaces rigid code with fluid workflows. Instead of routing rules, it writes natural-language instructions for each subtask, assigns the most suitable agent, and builds a context “access list” to share relevant prior outputs. This lets it craft workflows on the fly—linear chains, parallel trees, or recursive loops—tailored to the problem at hand.
The model learns not from human blueprints but through reinforcement. During training, it receives a task, a pool of seven diverse workers (Gemini 2.5 Pro, Claude Sonnet 4, GPT-5, plus four open-source models), and a reward signal for correct answers. Through trial-and-error, it discovers which instruction combinations and communication topologies maximize reward, adopting advanced tactics like targeted prompt engineering and iterative refinement.
Benchmark-beating with fewer tokens
To validate the approach, researchers fine-tuned a 7-billion-parameter Qwen2.5-7B model as the Conductor. They tasked it with designing up to five-step workflows using the seven-agent pool. On challenging benchmarks, the Conductor surpassed individual frontier models and state-of-the-art multi-agent routers:
- Average score across all tasks: 77.27%
- AIME25 math benchmark: 93.3%
- GPQA-Diamond reasoning: 87.5%
- LiveCodeBench coding: 83.93%
Crucially, efficiency gains were dramatic. While baselines like Mixture-of-Agents consumed 11,203 tokens per query, the Conductor averaged just 1,820 tokens and completed tasks in roughly three steps. Its adaptive depth shines in complexity: simple queries resolve in one or two steps, whereas tough coding problems trigger four-step workflows with dedicated planning, implementation, and verification phases.
What’s next for AI agent orchestration
RL Conductor proves that small, RL-trained models can outperform expensive, human-engineered pipelines on reasoning-heavy workloads. By shifting orchestration from static code to dynamic learning, Sakana AI offers a path to production systems that scale without constant manual tuning. Expect future work to extend these principles to multimodal tasks and real-time collaboration across agents.
AI summary
Sakana AI, RL Conductor adlı bir model geliştirdi. Bu model, büyük dil modellerini otomatik olarak yönetmek ve difficile görevleri çözmek için tasarlandı.
