Why AI models silently rewrite your documents — and how to catch the errors

Enterprise teams are increasingly relying on large language models to handle knowledge-intensive tasks, from reorganizing financial ledgers to editing legal contracts. The promise is clear: delegate tedious document processing and let AI deliver polished results. But a recent Microsoft study suggests this trust may be misplaced. Researchers found that even the most advanced AI models silently corrupt document content during multi-step workflows, often in ways that are nearly impossible to detect.

The hidden cost of delegated AI workflows

The Microsoft research team introduced the concept of "delegated work" — scenarios where users entrust AI systems to autonomously process, edit, or restructure documents across multiple interactions. Unlike traditional one-off prompts, these workflows demand sustained accuracy over extended sessions. Common applications include splitting accounting files into categories, condensing research reports, or extracting structured data from unstructured text.

For professionals pressed for time, delegation is attractive because it shifts repetitive cognitive load to AI. However, the study raises a critical question: How reliable are these models when operating beyond single-turn interactions? To answer this, the researchers developed DELEGATE-52, a benchmark comprising 310 simulated professional environments spanning 52 domains, including software development, financial analysis, and creative writing.

How the benchmark exposes AI’s blind spots

Evaluating multi-step document editing typically requires labor-intensive human review, which is expensive and subjective. Instead, the team designed a "round-trip relay" method inspired by backtranslation in machine translation. Each editing task includes a reversible instruction — for example, splitting a ledger into expense categories followed by merging them back — forcing the model to perform both operations independently.

This approach isolates content degradation by measuring how much information is lost or distorted during the round trip. The models tested — including variants from OpenAI, Anthropic, Google, and others — were unaware they were participating in an inverse task, simulating the unpredictability of real-world usage.

Distractor documents, containing 8,000 to 12,000 tokens of irrelevant but thematically similar text, were also introduced to test focus. Researchers wanted to see whether models could ignore irrelevant data or if they would incorporate incorrect information into their edits. The results were troubling.

Frontier models fail under sustained workloads

After simulating 20 consecutive editing interactions, the study found that all tested models introduced significant errors. The average document degradation across all models reached 50%, with even the top-tier models — including Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 — corrupting around 25% of content. Only Python-related tasks consistently achieved high accuracy, with scores above 98%, while domains like fiction writing, legal statements, and recipe editing showed severe degradation.

The failures were not gradual. Approximately 80% of total corruption stemmed from sparse but catastrophic events, where a single interaction caused a sudden loss of at least 10% of document content. This pattern suggests that while models may appear stable in short bursts, their reliability collapses under sustained, real-world conditions.

Philippe Laban, Senior Researcher at Microsoft and co-author of the paper, emphasized that the benchmark reflects genuine workflow challenges: "These models are not designed to remember previous steps or maintain long-term coherence. When you chain 20 interactions without human oversight, even small missteps compound into irreversible damage."

What this means for automated knowledge work

The findings underscore a growing paradox in AI adoption: the more advanced models become, the more we rely on them for complex tasks — yet their reliability diminishes as workflows deepen. The study authors caution that current systems lack the guardrails to prevent silent corruption in delegated environments, especially when distractor data or multi-step instructions are involved.

For organizations considering AI-driven automation, the implications are clear. Blind trust in model outputs is risky. Implementing verification layers — such as automated round-trip validation, human-in-the-loop reviews, or checksum-based integrity checks — may be necessary to prevent costly errors. Until models demonstrate sustained fidelity across long-horizon tasks, delegation should be approached with robust oversight and clear fallback mechanisms.

AI summary

Microsoft araştırması, önde gelen yapay zeka modellerinin belgeleri yeniden yazdığını ve ortalama %25 içerik kaybına yol açtığını ortaya koydu. Otomasyon için dikkat edilmesi gereken riskler.

Why AI models silently rewrite your documents — and how to catch the errors

The hidden cost of delegated AI workflows

How the benchmark exposes AI’s blind spots

Frontier models fail under sustained workloads

What this means for automated knowledge work

Comments

AI IQ ratings reveal top models cluster near human genius levels

Anthropic re-enables third-party AI agents on Claude with new billing rules

Anthropic surpasses OpenAI in enterprise AI adoption — but risks loom large