iToverDose/Software· 23 MAY 2026 · 08:00

How a DIY AI gateway slashed my cloud costs by routing tasks smartly

A weekend project turned a single AI endpoint into a cost-saving traffic cop, automatically sending simple requests to a free local model and reserving cloud LLMs only when absolutely needed.

DEV Community5 min read0 Comments

Last Saturday at dawn, my autonomous AI assistant quietly handled an impressive workload without a single prompt: it cross-checked 14 restaurant ratings in Bangalore’s Indiranagar, updated a shared spreadsheet I’d ignored for days, signed a 20-page PDF, and even generated a bash script to clean up server logs—all on its own initiative.

The assistant is named OpenClaw, a long-running agent living on a Raspberry Pi connected to Discord and running around the clock. It manages memory, researches topics, writes code, edits documents, and scouts weekend hangouts by scraping live ratings—essentially automating half of my daily routine on autopilot.

But a recurring frustration emerged. When I asked OpenClaw to write a Python script to parse JSON logs, the request was routed through a paid cloud API. The task could have been handled instantly by a local model running idle on my Mac Mini just three feet away. Moments later, when I posed a deeper reasoning question about event-driven versus polling architectures for a notification system, that cloud API was the right destination. The same agent. The same endpoint. Completely different needs for the same request stream.

This inconsistency sparked a simple but powerful idea: what if the system could decide which AI model to use before any token is spent?

It turned out the idea wasn’t silly at all. Over a single weekend, using a Raspberry Pi, a Mac Mini, 50 lines of Python, and an open-source gateway, I built a lightweight router that makes these decisions in real time.

Here’s how it works.

The Living Room Setup

My setup is entirely home-based, powered by off-the-shelf hardware and open tools:

  • Raspberry Pi: Runs OpenClaw, the autonomous agent that ingests Discord inputs, manages context and memory, and coordinates all tasks.
  • Mac Mini: Acts as the brain farm hosting three key components:
  • Ollama with the qwen2.5-coder:7b model, a local coding specialist that never leaves my network
  • AgentGateway, an open-source AI routing layer from Google
  • A custom Python router that classifies intent and injects routing headers

OpenClaw sends every request to a single endpoint—` lets the background system handle the rest. The agent itself remains unaware of the internal routing logic.

How the Architecture Routes Intelligently

Three distinct models. Three price tiers. One unified interface. OpenClaw never needs to know which model responds—it just hits the endpoint and moves on.

Why AgentGateway Stands Out

I evaluated several options—raw Envoy, Nginx with Lua scripting, and even a bespoke proxy—before settling on AgentGateway. Here’s what tipped the scales:

  • Protocol translation: It exposes an OpenAI-compatible API on the front end while seamlessly connecting to backends like Gemini, Vertex AI, Bedrock, and Ollama without requiring provider-specific code.
  • Backend authentication: API keys are managed centrally. OpenClaw never sees or stores any key, reducing exposure.
  • Model aliasing: OpenClaw sends model: "inteli-llm" in each request; AgentGateway silently translates it to qwen2.5-coder:7b, gpt-4o, or gemini-2.5-flash based on routing rules.
  • Observability: Every request logs provider, model, token usage, and latency, giving me full visibility into spending and performance.
  • Enterprise-grade features: Built-in PII masking, content moderation via webhooks, rate limiting, weighted load balancing, and automatic failover if a backend crashes.

That said, AgentGateway routes based on path, headers, and methods—it does not parse request bodies to decide destinations. That limitation became the catalyst for the next piece of the puzzle.

The 50-Line Router That Makes It Happen

I built a minimal FastAPI proxy that sits in front of AgentGateway. Its job is to classify intent before routing:

  • Intercepts each OpenAI-compatible request
  • Examines the last message in the chat history
  • Uses keyword matching and prompt length heuristics to decide intent:
  • Keywords like code, python, script, function, or bugcoding
  • Keywords like think, analyze, reasoning, or deduce, or prompts longer than 400 characters → reasoning
  • Anything else → simple
  • Injects an x-intent header (coding, reasoning, or simple)
  • Forwards the unchanged request to AgentGateway

No machine learning, no vector databases, no semantic similarity models—just a lightweight classifier that works reliably in 90% of cases, perfect for a homelab environment.

coding_keywords = ["code", "python", "javascript", "bash", "script", "function", "bug"]
reasoning_keywords = ["think", "analyze", "explain in detail", "reasoning", "logic", "deduce"]

if any(k in prompt_lower for k in coding_keywords):
    intent = "coding"
elif len(prompt) > 400 or any(k in prompt_lower for k in reasoning_keywords):
    intent = "reasoning"
else:
    intent = "simple"

Real-World Cost Savings in Practice

Here’s the breakdown of actual cost impact:

| Intent | Model | Location | Cost per 1M tokens | |------------|---------------------|------------------|--------------------| | Coding | qwen2.5-coder:7b | Local (Ollama) | $0 | | Simple Q&A | gemini-2.5-flash | Google Cloud | ~$0.15 | | Deep Reasoning | gpt-4o | OpenAI | ~$2.50 |

Before the router, every request went to a cloud API, regardless of complexity. Now, roughly 60–70% of queries stay local—handling coding tasks, quick lookups, and formatting jobs instantly and for free. The expensive reasoning model only gets invoked when genuinely needed, while mid-tier models cover everything in between.

The result? A noticeable drop in my monthly API bill and faster responses for most tasks.

Design Lessons That Translate Beyond the Lab

Header-based routing scales effortlessly. By pushing intent classification to a thin proxy layer, OpenClaw remains clean and decoupled from routing logic. This separation of concerns makes the system easier to test, extend, and maintain.

Local models are underutilized powerhouses. Many developers overlook idle local LLMs for trivial tasks—only routing cloud-bound traffic when necessary can unlock substantial savings and privacy benefits.

Gateways should stay fast and simple. AgentGateway’s protocol-level routing keeps latency low, while the intent router adds negligible overhead. This hybrid architecture balances performance with functionality.

While content-aware routing isn’t in scope for gateways, a lightweight proxy layer can bridge the gap—proving that even modest engineering effort can yield outsized returns.

With this system now handling thousands of requests monthly, the next step is refining intent classification and expanding the roster of local models. The goal remains unchanged: keep automation flowing, keep costs predictable, and keep the AI assistant humming—24/7, from a living room in Bangalore.

AI summary

Kendi yaşam alanından sürekli çalışan otonom yapay zeka aracını geliştirin. Yerel ve bulut tabanlı modelleri birleştirerek maliyetinizi azaltın.

Comments

00
LEAVE A COMMENT
ID #7YX2KC

0 / 1200 CHARACTERS

Human check

7 + 3 = ?

Will appear after editor review

Moderation · Spam protection active

No approved comments yet. Be first.