Multi-agent AI systems promise to solve problems that single models can’t handle alone, but their reliance on sequential text generation creates bottlenecks that inflate costs and slow down performance. Researchers from the University of Illinois Urbana-Champaign and Stanford University have introduced RecursiveMAS, a framework that replaces text-based agent communication with continuous latent embeddings, delivering a 2.4x speedup and a 75% reduction in token usage without sacrificing accuracy.
Why multi-agent systems struggle with scalability
Large language models excel at single-turn tasks, but complex workflows often require multiple specialized agents working in tandem. A common approach is prompt-based adaptation, where a central system refines the shared context to guide agent interactions. While this improves short-term alignment, the underlying model weights remain static, limiting long-term adaptability.
Training multi-agent systems end-to-end is technically possible but computationally prohibitive. Updating all parameters across interconnected models introduces stability risks and demands massive computational resources. Even if teams invest in full fine-tuning, the communication bottleneck persists: agents must wait for text outputs from predecessors before proceeding, slowing inference to a crawl.
Generating intermediate reasoning as text is inefficient for two reasons. First, it forces models to tokenize every step, inflating compute costs and latency. Second, the next agent must reinterpret that text, introducing noise and redundancy. These inefficiencies make iterative learning across the system painfully slow and expensive to scale.
How RecursiveMAS reframes agent collaboration
Instead of optimizing agents in isolation, RecursiveMAS treats the entire multi-agent system as a single recursive framework. The approach draws inspiration from recursive language models, which reuse a fixed set of layers to process data in loops rather than linearly. By allowing models to refine their hidden states across iterations without adding parameters, these systems achieve deeper reasoning with fewer resources.
RecursiveMAS extends this principle to multi-agent architectures by replacing text exchanges with continuous latent representations. Each agent functions like a layer in a recursive model, passing its embeddings to the next agent in sequence. The final agent’s output is fed back to the first, creating a closed loop that enables iterative refinement without generating intermediate text.
This latent collaboration allows agents to communicate "telepathically"—exchanging high-dimensional reasoning directly—while only the final agent produces a textual response. The result is faster inference, lower token costs, and a system that co-evolves as a cohesive unit rather than a collection of siloed models.
Inside the RecursiveLink architecture
To enable latent-space collaboration, the researchers introduced the RecursiveLink, a lightweight two-layer module that transmits and refines latent states between agents. Unlike traditional methods that require updating every parameter in large language models, RecursiveMAS keeps model weights frozen and focuses training on the RecursiveLink modules.
The framework employs two types of RecursiveLinks: inner and outer. The inner variant operates within each agent, mapping newly generated embeddings back into the model’s input space. This allows agents to generate a continuous stream of latent thoughts without tokenizing intermediate steps.
The outer RecursiveLink acts as a bridge between agents, accommodating differences in model architectures and embedding dimensions. It includes an additional layer to align embeddings from one agent’s hidden space with another’s, ensuring seamless communication even when models vary in size or design.
During training, the inner links are first trained independently to warm up each agent’s latent reasoning capabilities. Once stabilized, the outer links are introduced to facilitate inter-agent collaboration. This staged approach reduces instability and accelerates convergence, making the system practical for real-world deployment.
Benchmark gains and cost efficiency
Early experiments show RecursiveMAS delivers measurable improvements across domains where multi-agent systems traditionally struggle. In code generation tasks, the framework achieved higher functional correctness scores while reducing inference time. Medical reasoning benchmarks revealed improved diagnostic accuracy, and search-based evaluations demonstrated faster query resolution with fewer tokens.
The system’s efficiency extends to training as well. RecursiveMAS avoids the computational overhead of full fine-tuning or LoRA, instead focusing on lightweight modules. This makes it significantly cheaper to scale, whether for custom multi-agent deployments or research prototypes. For organizations building agentic systems, the framework offers a path to faster iteration, lower costs, and more reliable performance.
Looking ahead, the team behind RecursiveMAS plans to expand testing to additional domains, including robotics and multi-modal reasoning. The framework’s latent collaboration model could also inspire new approaches to agent training, where systems learn to evolve collectively rather than in isolation. As AI workloads grow in complexity, tools that prioritize both performance and efficiency will define the next generation of scalable multi-agent systems.
AI summary
Çoklu ajan yapay zekâ sistemlerinde gizli temsil alanında iletişim kuran RecursiveMAS, çıkarım hızını 2.4 kat artırırken token kullanımını %75 azaltıyor.


