TinySearch helps small LLMs research the web without bloated context

Local language models running on consumer hardware face a hidden bottleneck: web search. Tools that feed raw page text into models flood small LLMs with navigation menus, cookie banners, and duplicate paragraphs, wasting tokens and confusing smaller models that can’t filter noise. The result? Agent workflows slow down, costs rise, and reasoning quality drops.

To fix this, developer Marcell Muth recently released TinySearch, an open-source MCP tool that transforms web searches into clean, source-grounded prompts. Instead of dumping entire pages into context, TinySearch curates only the most relevant chunks, ranks them, and packages them as structured evidence for the model. The goal isn’t to replace commercial search APIs—it’s to give small LLMs a lightweight way to research the web without drowning in context bloat.

How TinySearch cleans up web search for local LLMs

TinySearch operates as a middle layer between your query and the answering model. It doesn’t generate answers itself; instead, it prepares a focused prompt containing only the essential evidence needed for reasoning. The workflow follows a simple sequence: search, crawl, rerank, and return.

Search: Uses DuckDuckGo to fetch HTML results for the user’s query.
Rerank: Filters out irrelevant snippets and selects the most promising links.
Crawl: Extracts clean markdown content from selected pages using Crawl4AI.
Chunk and rerank again: Breaks content into digestible pieces and ranks them by relevance.
Deduplicate and enforce quotas: Removes duplicates and caps the number of chunks per source.
Package as a prompt: Returns a structured response with the question, date, instructions, sources, and top-ranked chunks.

The final output is a compact, time-stamped prompt that tells the model exactly what evidence to use, where it came from, and when the search was performed. This prevents hallucinations on time-sensitive queries and keeps context windows lean.

Why small models need this kind of tool

For developers running 4B or 9B parameter models at home, context window size is a precious resource. Every unnecessary token dilutes the signal the model can use. When a tool injects entire pages—including SEO filler, broken markdown, and duplicated paragraphs—it forces the model to waste computational effort filtering noise before it can even begin reasoning.

TinySearch targets three key pain points:

Token efficiency: Only relevant chunks are included, reducing wasted context.
Clarity for small models: Clean, ranked snippets improve reasoning quality.
Lightweight integration: Works as a standalone MCP tool or via Docker, without requiring a full search stack.

Unlike commercial tools that optimize for scale and speed, TinySearch focuses on the “annoying middle ground”—local agent workflows where developers want web research but don’t need enterprise-grade infrastructure.

Setting up TinySearch: Docker, Glama, or self-hosted

Adopting TinySearch doesn’t require deep infrastructure knowledge. The easiest path is through Glama, a platform that offers TinySearch as a pre-configured option. Users can plug it directly into MCP-compatible workflows without hosting anything locally.

For those who prefer self-hosting, TinySearch provides a ready-to-run Docker image. Running it takes a single command:

docker run --rm -p 8000:8000 \
  -e MCP_TRANSPORT=streamable-http \
  -e MCP_HOST=0.0.0.0 \
  marcellm01/tinysearch:latest

After starting the container, connect your MCP client using this configuration:

{
  "mcpServers": {
    "tinysearch": {
      "url": "
    }
  }
}

TinySearch exposes a single tool: research(query). Pass the user’s question directly, and the tool handles the rest—crawling, chunking, reranking, and packaging the evidence. An optional FastAPI server is also available for HTTP-based integration.

Under the hood: The search pipeline explained

The tool’s current pipeline combines proven open-source components with custom optimizations for small-model workflows. Here’s how it works step-by-step:

Query submission: The user’s question is fed into the system.
HTML search: DuckDuckGo returns raw HTML snippets for the query.
Result reranking: A lightweight reranker filters out irrelevant or low-quality links.
Page crawling: Selected pages are scraped for clean markdown using Crawl4AI, which avoids common pitfalls like navigation noise and broken formatting.
Chunking and reranking: Extracted content is split into semantic chunks, then reranked using local or API-based embeddings.
Deduplication and quotas: Duplicate chunks are removed, and source contributions are capped to prevent bias.
Prompt assembly: The final output is a structured prompt containing the question, date, instructions, source titles, URLs, search previews, and the top-ranked chunks.

This pipeline prioritizes precision over coverage. The goal isn’t to return every possible result—it’s to return the fewest, most relevant chunks that a small model can actually use.

Embeddings and customization: Start simple, tune later

TinySearch supports multiple embedding backends to balance speed and accuracy. By default, it offers three local presets using ONNX models:

fast: Uses all-MiniLM-L6-v2 for quick but basic relevance scoring.
balanced: Uses bge-small-en-v1.5 for a middle ground between speed and accuracy.
quality: Uses bge-base-en-v1.5 for higher precision at the cost of speed.

For teams already using vector databases or cloud APIs, TinySearch can also connect to OpenAI-compatible embedding endpoints. Configuration options include search depth, rerank weights, chunk limits, crawl concurrency, and tokenizer settings—allowing fine-tuning without rewriting the core logic.

What TinySearch isn’t—and why that’s intentional

TinySearch isn’t designed to replace Perplexity, Exa, or Tavily. It doesn’t build long-term indexes, guarantee perfect coverage, or scale to enterprise workloads. Instead, it fills a narrow niche: giving small, local LLMs a way to research the web without getting buried in context.

This approach reflects a broader trend in local AI development. Many tools prioritize features over usability, adding complexity that small models can’t handle. TinySearch strips away the bloat, focusing on one thing: returning useful research context, not a landfill of raw page text.

The future of lightweight web search for LLMs

As local LLMs grow more capable, the tools around them need to evolve too. Developers aren’t waiting for bigger models to solve context problems—they’re building smarter pipelines that respect hardware limits. TinySearch is one such tool, offering a glimpse of what lightweight, source-grounded research could look like in agent workflows.

For now, it’s a niche solution—but for developers running small models at home, that niche is significant. The next step might be tighter integration with RAG frameworks or support for federated search across personal knowledge bases. Whatever comes next, the principle remains the same: keep the context clean, keep the model focused, and let the reasoning begin.

AI summary

TinySearch filters web search results into clean, source-grounded snippets for small LLMs, cutting context bloat and improving reasoning without heavy infrastructure.

TinySearch helps small LLMs research the web without bloated context

How TinySearch cleans up web search for local LLMs

Why small models need this kind of tool

Setting up TinySearch: Docker, Glama, or self-hosted

Under the hood: The search pipeline explained

Embeddings and customization: Start simple, tune later

What TinySearch isn’t—and why that’s intentional

The future of lightweight web search for LLMs

Comments

Why Companies Should Focus on Operations, Not Build Tech Stacks

Cut Aider AI coding costs with a single LLM gateway setup

Python YouTube downloader with async downloads and real-time queue management