AI agents corrupt documents in 80 percent of professional domains, according to new research from Microsoft Research published in May 2026. The study tested frontier models including Claude 4.6 Opus, GPT 5.4, and Gemini 3.1 Pro across 52 professional workflows and found an average 25 percent content loss across extended multi-step sessions. More surprising: equipping agents with file-reading and writing tools made the problem 6 percent worse.
If you use Claude Code, Codex, or any AI agent to automate file organization, document editing, or data processing in your creative workflow, this research is a concrete warning with specific numbers attached.
What Is DELEGATE-52
Researchers Philippe Laban, Tobias Schnabel, and Jennifer Neville from Microsoft Research designed DELEGATE-52 to simulate real-world delegated workflows across 52 professional domains. DELEGATE stands for Delegated Evaluation of LLM Agents Testing Evolving document quality.
Unlike benchmarks that test isolated single-step tasks, DELEGATE-52 chains multiple operations together the way a real workflow actually runs. A representative test: take an accounting ledger, split it by expense category into separate files, then merge the categories back chronologically. The models are scored on how much of the original document content survives 20 delegated interactions. The passing threshold is 98 percent or higher.
The 52 domains include areas that map directly to creative and technical workflows: code writing, music notation, crystallography data, financial accounting, and natural language documents. The full paper is available at arxiv.org/abs/2604.15597 and is titled "LLMs Corrupt Your Documents When You Delegate."
What the Models Actually Did
The results across all model and domain combinations show fewer than 20 percent of combinations met the 98 percent threshold. Frontier models averaged 25 percent content loss over 20 interactions. Across all tested models including older versions, the average loss climbed to 50 percent.
| Model | Domains Passed (of 52) | Notes |
|---|---|---|
| Gemini 3.1 Pro | 11 | Best performer overall |
| Claude 4.6 Opus | Low | Similar failure rate in non-programming domains |
| GPT 5.4 | Low | Consistent degradation across domains |
| GPT 5.2, 5.1, 4.1 | Low | All showed the same pattern |
Exactly one domain passed reliably across models: Python programming. Every other domain fell below the threshold. The researchers note this holds because code has formal syntax rules that make errors detectable. Unstructured content, including natural language documents and creative files, has no equivalent safety net.
Failures were not gradual. The paper describes sudden catastrophic drops: models would handle 15 interactions correctly, then lose 20 to 30 points in a single step. This failure pattern is particularly dangerous because your workflow appears healthy until it is not.

The Tool-Using Paradox
The finding that will surprise most workflow builders: giving agents tools made performance worse. When researchers equipped models with file reading, writing, and code execution capabilities (the standard setup for agentic workflows), average performance degraded by an additional 6 percent compared to models without tool access.
The premise of AI agents is that tools extend capability. In document manipulation tasks, tool access appears to amplify the core problem: models lose track of what content existed, overwrite data while reorganizing, and fail to verify their own outputs before moving to the next step.

Why This Matters for Creative AI Workflows
Most coverage of AI agents focuses on speed and capability gains. This research raises a different question: what are you risking when you delegate file operations?
For creators, the at-risk workflows include project file organization (asking an agent to sort or rename creative assets), script and document editing across multiple revision passes, data extraction and reformatting between file types, batch content generation where the agent writes many files in sequence, and codebase refactoring (the one domain where AI performed adequately, but only barely).
The scale of adoption makes this urgent. Tools like Claude Code and OpenAI Codex are now in the hands of hundreds of thousands of developers and creative professionals running automated workflows daily. Many run these workflows unmonitored. The DELEGATE-52 data says that level of trust is not yet supported by model performance for most document types.
How to Build Safer AI Agent Workflows
The researchers' conclusion is direct: "users still need to closely monitor LLM systems as they operate and complete tasks on their behalf." Here is what that looks like for a practical creative workflow.
Step 1: Add a verification checkpoint after every major operation. Do not chain 10 steps and review the output at the end. After each significant file operation, inspect what the agent produced before authorizing the next step. Catastrophic drops in DELEGATE-52 happened mid-workflow, not only at the end.
Step 2: Keep originals before any agent touches a file. Use git version control for text files. For creative assets, maintain a pre-agent snapshot in a separate folder. Many automation setups skip this in the name of speed.
Step 3: Test on throwaway copies first. Before deploying a new AI workflow on real project files, run the same workflow on dummy data in the same format. Check output at every step. Deploy on real work only after confirming it.
Step 4: Prefer structured formats for delegated tasks. The domain where AI passed was Python code, because code syntax makes errors visible. Where possible, represent your data in structured formats (JSON, CSV, YAML) rather than freeform text. Errors in structured data are harder to hide.
Step 5: Treat file-access tools with caution. Since tool access worsened performance by 6 percent, consider whether file write access is necessary at each step. Some workflows can be redesigned so the AI generates output that a human or deterministic script writes to disk, rather than giving the agent direct write access.
Step 6: Break long workflows into shorter sub-tasks. The DELEGATE-52 study ran 20 interactions. Performance degraded significantly with each additional step. For workflows that must run long, build in human handoffs rather than one continuous 20-step chain.

What This Research Does Not Mean
This research does not say AI agents are useless. Python code manipulation passed. Shorter, contained tasks that do not involve multi-step document manipulation are outside the scope of this study. The researchers measured delegated workflows (tasks where you hand off and walk away), not supervised collaboration where a developer reviews each action.
Supervised use of Claude Code and similar tools, where a developer reviews each output, is a different risk profile from fully automated pipelines. The DELEGATE-52 warning applies most directly to automation that runs without human review at each step.
These models will improve. The benchmark exists to drive that improvement. But as of May 2026, with models currently in production, the data is clear: do not trust unmonitored multi-step file operations on content you cannot afford to corrupt.
Frequently Asked Questions
What is DELEGATE-52?
DELEGATE-52 is a benchmark created by Microsoft Research to evaluate how well AI models handle multi-step document manipulation tasks across 52 professional domains. It measures how much content AI agents preserve or lose through extended workflows of up to 20 interactions.
Which AI models were tested in DELEGATE-52?
The study tested Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4, GPT 5.2, GPT 5.1, and GPT 4.1. All are current frontier models available via API.
How severe is the content loss?
Frontier models lost an average of 25 percent of document content over 20 interactions. Across all tested models, the average loss was 50 percent. Only Python programming consistently met the 98 percent preservation threshold.
Does giving AI agents tool access help document preservation?
No. Giving models file read, write, and code execution tools degraded performance by an additional 6 percent on average compared to models without tool access.
Is this a problem specific to one AI provider?
No. The same degradation pattern appeared across models from Google, OpenAI, and Anthropic. This is a general LLM limitation in extended delegation tasks, not specific to any single provider.
Where is the full research paper?
The preprint is at arxiv.org/abs/2604.15597, titled "LLMs Corrupt Your Documents When You Delegate." It was published in April 2026 and reported by The Register on May 11.
Should I stop using AI agents for creative file tasks entirely?
For Python code manipulation, current models are adequate with proper oversight. For other document types, add human checkpoints at each major step, keep backups, and avoid fully unmonitored automation on irreplaceable files.