How a Three-Stage Pipeline Powers the Mano-P GUI Agent

Training an AI to use a computer isn’t just about clicking buttons—it’s about navigating millions of pixels, interpreting unfamiliar interfaces, and making split-second decisions. Mano-P, a vision-language-action agent designed to operate on edge devices like laptops, tackles this challenge with a three-stage training pipeline that transforms raw imitation into adaptive problem-solving.

The 4-billion parameter model runs efficiently on an Apple M5 Pro chip, decoding at approximately 80 tokens per second while leading the OSWorld benchmark with 58.2% accuracy—13.2 percentage points ahead of the next best system. But the real breakthrough lies in how it was trained, not just what it achieved.

The Foundation: Learning from Demonstrations

Before Mano-P could reason about actions or recover from errors, it needed to understand the basics of computer interfaces. Enter supervised fine-tuning (SFT), the first stage where the model learned by mimicking human experts.

Training data consisted of annotated GUI interaction traces, each containing a screenshot, the user’s thought process, and the corresponding action—click, type, scroll, or navigate. The model absorbed patterns like:

Identifying UI elements directly from pixel data, such as buttons, text fields, and menus, without relying on DOM trees or accessibility APIs that may not exist in all applications.
Mapping actions to screen coordinates, including complex gestures like drag-and-drop or keyboard shortcuts.
Breaking down high-level tasks into manageable steps, such as transforming "open settings and change the wallpaper" into a sequence of discrete actions.

The result was a model that could reliably perform familiar tasks but struggled with novel interfaces or multi-step workflows with branching logic. It had memorized solutions rather than developed adaptable strategies—a deliberate outcome for this stage.

Refining Decisions Without Live Risks

Offline reinforcement learning (RL) marked the second stage, where the model learned from a curated dataset of pre-collected trajectories—both successful and failed attempts. This approach avoided the pitfalls of early-stage exploration, where an untrained model might repeatedly trigger system dialogs or navigate into dead ends.

The dataset included:

Expert demonstrations, representing high-reward paths to task completion.
The model’s own prior rollouts, capturing mixed outcomes from earlier training phases.
Trajectories from intermediate checkpoints, providing varied quality signals.

By analyzing this data, the model learned to extract meaningful patterns from suboptimal examples. For instance, it recognized that clicking a certain button in a specific context often led to dead ends, while scrolling up in a list usually improved navigation. Key improvements from this stage included:

Error recovery: Recognizing when an action failed and determining the next best step.
Alternative strategies: Developing multiple pathways to achieve the same goal, prioritizing more reliable routes.
State assessment: Evaluating progress in real time to distinguish productive actions from wasted effort.

Offline RL produced a significantly more robust model, capable of handling unfamiliar layouts by learning from diverse trajectories—not just perfect paths. However, it remained constrained by static data, unable to adapt to live changes in applications or operating systems.

Adapting in Real Time to Unseen Challenges

The final stage brought Mano-P into live environments through online RL, where it interacted with real—or highly realistic simulated—systems and learned from direct feedback. By this point, the model was already competent at basic tasks and error recovery, making its exploration productive rather than random.

The online phase focused on three core objectives:

Direct interaction: Executing actions, observing results, and adjusting strategies in real time.
Reward optimization: Balancing task completion with efficiency, where fewer steps translated to higher rewards.
Handling distributional shifts: Encountering and adapting to states not present in training data, such as new app versions or unexpected dialog boxes.

The reward system combined binary completion metrics (did the task finish?) with step efficiency (fewer actions = better) and a verification component to ensure robustness. Pure completion rewards often led to brittle policies that technically succeeded but used inefficient or fragile methods. The efficiency component pushed the model toward direct, reliable solutions.

The Edge-Device Advantage

Mano-P’s three-stage pipeline isn’t just a technical achievement—it’s a practical one. By training offline first, the model avoids the instability and computational cost of live exploration during early phases. Then, by fine-tuning online, it adapts to real-world conditions without requiring massive data centers.

The result is an agent that doesn’t just perform tasks—it learns from them, evolves with its environment, and remains efficient even on consumer-grade hardware.

AI summary

Discover how Mano-P’s SFT, offline RL, and online RL stages create a robust GUI agent that runs on laptops with top OSWorld benchmark results.

How a Three-Stage Pipeline Powers the Mano-P GUI Agent

The Foundation: Learning from Demonstrations

Refining Decisions Without Live Risks

Adapting in Real Time to Unseen Challenges

The Edge-Device Advantage

Comments

Relive the DOS era: How developers shipped code on just 640KB RAM

Why RAG Testing Requires a Fresh Approach Beyond Traditional QA

AI Agents Can't Collaborate—Here's How to Fix Agent Interoperability