AI agents often hit a wall when they can’t remember earlier steps in a task—whether a coding assistant loses track of a debugging thread or a data analysis agent re-reads the same context repeatedly. These failures slow workflows, inflate costs, and introduce brittleness. Teams usually respond by expanding context windows or layering in retrieval-augmented generation (RAG) systems, but those fixes grow expensive and don’t reliably deliver long-term continuity.
A new approach called delta-mem, developed by researchers at the Mind Lab and several universities, compresses historical data into a compact, dynamically updated matrix without changing the underlying language model. It adds only 0.12% of the backbone model’s parameters—far below the 76.40% required by a leading alternative—and outperforms it on memory-intensive benchmarks.
Why conventional memory falls short
Most systems treat memory as a context-management problem: either they keep stuffing more information into the model’s context window or they offload retrieval to external RAG modules. Jingdi Lei, a co-author of the paper, noted that both strategies miss the mark for long-running, multi-step interactions. “These methods work up to a point, but they’re expensive and brittle when agents need to operate across extended sessions,” Lei told VentureBeat. “They don’t behave like human memory—they’re more like document lookup than recall.”
Enterprise teams face a dual challenge: not just storing history, but reusing it efficiently, continuously, and with low latency. Standard attention mechanisms scale quadratically with sequence length, and widening the context window doesn’t guarantee the model will actually retain or apply the information. Over time, models suffer from “context rot,” where overlapping or conflicting details degrade performance even if the system nominally supports million-token windows.
Existing solutions tend to fall into three categories, each with trade-offs:
- Textual memory: stores history as plain text injected into the prompt. It’s constrained by window limits and loses information when compressed.
- External retrieval (RAG): encodes and retrieves from outside modules. It adds latency, complicates integration, and can misalign with the core model.
- Parametric memory: bakes memory into model weights via adapters. It’s static after training and can’t adapt to new data during live interactions.
How delta-mem works
Delta-mem introduces an “online state of associative memory” (OSAM)—a fixed-size matrix that stores compact representations of past interactions while the language model itself remains unchanged. For enterprise workflows, this translates into real-world benefits. Lei pointed to a persistent coding assistant that needs to recall project conventions, recent debugging steps, user preferences, or intermediate decisions across a session. A data analysis agent, similarly, must maintain task state, assumptions, and prior observations while iterating over multiple tool calls.
Instead of repeatedly re-inserting full context each turn, delta-mem carries forward only the essential interaction states through the model’s forward pass. During generation, the system projects the model’s current hidden state into the memory matrix to retrieve relevant signals. These signals are converted into numerical corrections that nudge the model’s reasoning at inference time without altering its internal weights.
After each interaction, the memory matrix updates using a “delta-rule” mechanism. A gated delta-rule controls how much prior memory to retain and how much new memory to incorporate, allowing the system to correct errors while avoiding short-term noise. The team tested three update strategies:
- Token-state write: captures fine-grained changes but can be disrupted by transient noise.
- Sequence-state write: smooths updates by averaging tokens within a message segment, trading some detail for stability.
- Multi-state write: splits memory into sub-states for different information types, such as facts or task progress.
Performance on real-world tasks
The researchers evaluated delta-mem across three language model backbones: Qwen3-8B, Qwen3-4B-Instruct, and SmolLM3-3B. They configured the system with an 8x8 memory matrix and tested it on general benchmarks like HotpotQA, GPQA-Diamond, and IFEval. They also ran it on memory-heavy tasks such as LoCoMo, which measures long-term conversational memory, and the Memory Agent Bench, which assesses an agent’s ability to track state across extended interactions.
The results show delta-mem outperforming a leading alternative in memory-intensive scenarios while using a fraction of the parameters. The compact matrix format keeps computational overhead low, making it practical for deployment in latency-sensitive environments where every token and millisecond matters.
The road ahead
Delta-mem points to a future where AI agents retain continuity without ballooning costs or complexity. By embedding a lightweight, trainable memory layer inside the inference pipeline, systems can carry forward relevant history without repeatedly re-reading or re-parsing data. As models move toward multi-step, tool-integrated workflows, techniques like delta-mem could become a standard way to sustain coherent, low-latency behavior over long sessions.
AI summary
AI ajanların uzun süreli hafızası için delta-mem adlı yenilikçi yöntem. Sadece %0,12 parametre ekleyerek çalışan bu sistem, RAG ve genişletilmiş bağlam penceresinin yerini alabilir.


