Managing limited context windows in large language models (LLMs) resembles an old computing challenge: DOS’s 640KB memory ceiling. In the 1990s, tools like EMM386 used address-translation hardware to expose larger memory spaces through a small fixed window. Today, a similar approach is emerging for LLMs with LLM386, a runtime designed to dynamically manage context by paging only what’s relevant for the current operation.
The challenge of bounded context windows
LLMs operate within strict token limits—32K, 128K, or 1M tokens—while the data they need often exceeds these constraints. Conversation history, retrieved documents, tool results, and persistent facts can quickly overwhelm even the largest windows. The typical workaround—concatenating the last N messages plus vector search hits—fails when prompts grow too large to trace. Models produce answers without clear reasoning, and even minor changes can yield inconsistent outputs.
LLM386 addresses this by treating the model as a stateless function, where all continuity is reconstructed from a durable store managed by the runtime. This ensures reproducibility and transparency, two critical needs as AI agents grow more complex.
How the runtime works
LLM386 functions like a modern EMM386, providing a structured way to manage context within constrained windows. Its architecture includes several key components:
- Persistent block store: Uses LMDB for content-addressed, deduplicated storage, ensuring efficient retrieval and versioning.
- Pager: Dynamically selects blocks to fit the model’s input budget by running multiple retrievers in parallel—recency, BM25, embedding ANN, and custom options. Scores are normalized, merged, and allocated across canonical sections like System, Task, State, Plan, Retrieved, Tools, Recent, and Background.
- Packer: Converts the selected blocks into a deterministic prompt string or role-tagged chat message list.
- Tracer: Logs what the model sees and why, including byte-level prompt hashes for replayability.
- Reducer: Transforms model output back into committed state via parsed events.
- Typed-edge graph: Tracks dependencies between blocks, ensuring tool results remain paired with the messages that invoked them.
- Diff layer: Compares trace records turn-over-turn to identify changes in context.
All components are deterministic, ensuring prompts can be replayed exactly as seen by the model. The runtime avoids learned rerankers or embedding tweaks, as these would break replayability.
What’s excluded (by design)
LLM386 deliberately avoids features that could introduce unpredictability:
- No chatbot UI.
- No hidden state within prompts.
- No treating model output as ground truth.
- No distributed storage in the initial version.
- No learned components in the hot path.
These omissions reflect the runtime’s focus on clarity and control, making it ideal for debugging and reasoning about agent behavior.
Getting started with LLM386
The project offers a Rust library, Python SDK (via PyO3), and CLI, all licensed under Apache-2.0. An alpha release (1.0.0-alpha) is available, with no immediate plans for a UI or distributed storage.
To test LLM386, clone the repository and run a sample agent:
git clone
cd llm386
export ANTHROPIC_API_KEY=sk-ant-...
docker compose -f examples/langgraph-agent/docker-compose.yml run --rm agentWithin five minutes, you can interact with a small chatbot featuring two stub tools—a calculator and a fake user-profile lookup. Conversations persist across container restarts because the store is a Docker volume, and the model recalls prior turns based entirely on the runtime’s memory management.
Who should use LLM386?
This runtime is built for developers who need more than ad-hoc prompt assembly:
- Agent developers: If your prompts are unwieldy and you can’t trace what the model is seeing, LLM386 provides structure and clarity.
- Model swappers: The
ModelProfileabstraction handles context windows, tokenizers, and capability flags, so you can change models without rewriting prompt logic. - Debuggers: The tracer and diff layer make it easier to diagnose why a model behaves a certain way.
For quick chatbot demos, simpler solutions may suffice. But as agents scale, the need for a reliable, transparent context manager becomes unavoidable. LLM386 offers an approach that balances efficiency with reproducibility—a throwback to the 1990s, but with modern relevance.
AI summary
LLM386, 1990’lardaki EMM386 bellek yönetiminden esinlenen, LLM bağlam pencerelerini optimize eden bir runtime. Nasıl çalışır, kimler kullanmalı ve beş dakikada nasıl denenir?