Why 'Free' Local AI Executors Can Cost More Than Cloud Models

When the phrase “use a strong model to orchestrate and a cheap model to execute” became a cost-saving mantra for agentic coding, many developers took it at face value. The logic seemed sound: local executors like Qwen 3.5-9B run for free, so pairing them with a premium orchestrator like Opus 4.7 should slash cloud bills. But real-world testing tells a different story.

In 40 structured trials involving three code-repair tasks, the supposedly “free” Opus + Qwen setup emerged as the most expensive option across the board. It cost more than running Opus alone, more than Opus paired with Haiku, and far more than Haiku running solo. The results, published in a peer-reviewed study on Zenodo, expose a hidden paradox: the orchestrator’s repeated re-reading of the executor’s outputs can inflate cloud costs beyond the savings from zero-token execution.

The Setup: Four Configurations, 40 Trials, One Clear Winner

Each trial tested a different agentic coding strategy on the same codebase—the Typer repository at commit b210c0e—using a deterministic judging harness built on mypy, ruff, and pytest. The four configurations were:

Opus 4.7 solo — A single premium model handles planning, editing, and verification.
Opus 4.7 + Qwen 3.5-9B (local) — Opus orchestrates and verifies; Qwen edits code locally using Ollama, with zero token cost.
Opus 4.7 + Haiku 4.5 (cloud) — Opus orchestrates; Haiku edits via a sub-loop in the Anthropic SDK.
Haiku 4.5 solo — A single low-cost cloud model performs all tasks.

All models used the same toolset: a str_replace_editor for file operations and a bash tool with a 120-second timeout. Orchestrator models (Opus + Qwen/Haiku) had access to a delegate_to_executor function to offload edits. Prompt caching was enabled uniformly using Anthropic’s SDK, with the system prompt, tool definitions, and most recent user message marked for ephemeral caching.

The Three Tasks: From Breakage Recovery to Feature Addition

Each configuration faced three distinct challenges designed to test real-world coding scenarios:

T1 — Breakage Recovery: 25 synthetic errors were injected via AST manipulation (10 mypy errors, 10 ruff warnings, 5 pytest collection failures). The agent had to restore all tests to a passing state.
T2 — Refactor: The function get_params_from_function was moved from typer/utils.py to a new module, typer/_param_extractor.py, with all import statements updated and tests still passing.
T3 — Feature Add: A new function, get_version_banner(prefix, uppercase), was implemented, exported from typer/__init__.py, and verified using a SHA-256-fingerprinted test file.

Success was defined as a clean exit code from the verification harness—no subjective LLM-as-judge evaluations were used.

The Surprising Results: Zero Cost Didn’t Mean Zero Expense

Across 40 trials, total cloud spend reached $35.98—modest for a research project, but eye-opening for developers expecting savings. The Opus + Qwen configuration stood out as the most expensive option on every task:

T1: Opus + Qwen cost $2.27; Opus solo cost $1.74.
T2: Opus + Qwen cost $1.38; Opus solo cost $1.11.
T3: Opus + Qwen cost $0.42; Opus solo cost $0.17.

Even more striking, the Opus + Qwen setup consumed 1.4 to 5.3 times more input tokens on the Opus side than Opus running alone—despite Qwen’s tokens being free. The reason? Opus repeatedly re-read the executor’s output summaries due to Anthropic’s prompt caching mechanism. Each re-read triggered a cache read operation billed at 10% of input token rates, and over dozens of turns, those reads added up.

The Hidden Cost Driver: Prompt Cache Re-Reading

Token consumption on the Opus side revealed the full picture:

T1: Opus + Qwen read 733,142 tokens (Opus-side input + cache reads) versus 534,586 for Opus solo—a 1.38× increase.
T2: 313,914 tokens versus 226,474—a 1.39× increase.
T3: 62,864 tokens versus 13,320—a 5.26× increase.

The mechanism is simple but counterintuitive. When Opus delegates a task to the executor, Qwen returns a summary of its changes. That summary is cached by Anthropic’s SDK and then re-read by Opus in subsequent turns—each re-read billed as a cache read. Over 30 to 80 turns, those repeated reads dwarfed the savings from zero-token execution.

This phenomenon highlights a critical insight: the orchestrator’s cost scales with how often it re-reads the executor’s outputs, not with the executor’s raw token usage. The phrase “free executor” obscures the fact that the orchestrator’s context can balloon when it repeatedly ingests summaries.

Practical Takeaways: When Local Executors Make Sense

Despite the surprising outcome, not all local executor setups are doomed to overspend. The Opus + Haiku configuration balanced cost and reliability, outperforming Opus solo in wall clock time and cost on T2 and T3 while maintaining a 100% success rate. Haiku solo, while the cheapest overall, failed 25% of the time—highlighting the trade-off between cost and stability.

For developers considering local executors, the key takeaway is clear: don’t assume that zero token cost equals lower cloud expense. Measure token consumption on the orchestrator’s side, especially when using prompt caching. If the orchestrator re-reads executor outputs frequently, the savings may vanish—and in some cases, costs may rise.

Local AI execution isn’t inherently more expensive, but it’s not inherently cheaper either. The real cost driver is the orchestration overhead, which grows with every re-read of the executor’s work. Plan your architecture accordingly.

AI summary

Yerel modellerin token maliyeti sıfır olsa da, bulut tabanlı planlama modellerinin prompt önbellek okumaları nedeniyle toplam maliyet artabiliyor. Opus 4.7 + Qwen 3.5-9B kombinasyonunun neden en pahalı seçenek olduğunu araştırdık.

Why 'Free' Local AI Executors Can Cost More Than Cloud Models

The Setup: Four Configurations, 40 Trials, One Clear Winner

The Three Tasks: From Breakage Recovery to Feature Addition

The Surprising Results: Zero Cost Didn’t Mean Zero Expense

The Hidden Cost Driver: Prompt Cache Re-Reading

Practical Takeaways: When Local Executors Make Sense

Comments

Why your messy codebase makes AI tools stumble

How to Eliminate Static AWS Keys for Safer Cloud Deployments

How APIs are reshaping Africa’s $10B+ digital payments market