AI Agents Corrupt Files in 80% of Workflows

AI agents corrupt documents in 80 percent of professional domains, according to new research from Microsoft Research published in May 2026. The study tested frontier models including Claude 4.6 Opus, GPT 5.4, and Gemini 3.1 Pro across 52 professional workflows and found an average 25 percent content loss across extended multi-step sessions. More surprising: equipping agents with file-reading and writing tools made the problem 6 percent worse.

If you use Claude Code, Codex, or any AI agent to automate file organization, document editing, or data processing in your creative workflow, this research is a concrete warning with specific numbers attached.

What Is DELEGATE-52

Researchers Philippe Laban, Tobias Schnabel, and Jennifer Neville from Microsoft Research designed DELEGATE-52 to simulate real-world delegated workflows across 52 professional domains. DELEGATE stands for Delegated Evaluation of LLM Agents Testing Evolving document quality.

Unlike benchmarks that test isolated single-step tasks, DELEGATE-52 chains multiple operations together the way a real workflow actually runs. A representative test: take an accounting ledger, split it by expense category into separate files, then merge the categories back chronologically. The models are scored on how much of the original document content survives 20 delegated interactions. The passing threshold is 98 percent or higher.

The 52 domains include areas that map directly to creative and technical workflows: code writing, music notation, crystallography data, financial accounting, and natural language documents. The full paper is available at arxiv.org/abs/2604.15597 and is titled "LLMs Corrupt Your Documents When You Delegate."

What the Models Actually Did

The results across all model and domain combinations show fewer than 20 percent of combinations met the 98 percent threshold. Frontier models averaged 25 percent content loss over 20 interactions. Across all tested models including older versions, the average loss climbed to 50 percent.

Model	Domains Passed (of 52)	Notes
Gemini 3.1 Pro	11	Best performer overall
Claude 4.6 Opus	Low	Similar failure rate in non-programming domains
GPT 5.4	Low	Consistent degradation across domains
GPT 5.2, 5.1, 4.1	Low	All showed the same pattern

Exactly one domain passed reliably across models: Python programming. Every other domain fell below the threshold. The researchers note this holds because code has formal syntax rules that make errors detectable. Unstructured content, including natural language documents and creative files, has no equivalent safety net.

Failures were not gradual. The paper describes sudden catastrophic drops: models would handle 15 interactions correctly, then lose 20 to 30 points in a single step. This failure pattern is particularly dangerous because your workflow appears healthy until it is not.

Document with corrupted garbled text on bottom half and warning triangle — AI agents corrupted documents in 80% of tested professional domains.

The Tool-Using Paradox

The finding that will surprise most workflow builders: giving agents tools made performance worse. When researchers equipped models with file reading, writing, and code execution capabilities (the standard setup for agentic workflows), average performance degraded by an additional 6 percent compared to models without tool access.

The premise of AI agents is that tools extend capability. In document manipulation tasks, tool access appears to amplify the core problem: models lose track of what content existed, overwrite data while reorganizing, and fail to verify their own outputs before moving to the next step.

Bent wrench tool representing the paradox of tools making agents less reliable — More tools made agents less reliable, not more capable.

Why This Matters for Creative AI Workflows

Most coverage of AI agents focuses on speed and capability gains. This research raises a different question: what are you risking when you delegate file operations?

For creators, the at-risk workflows include project file organization (asking an agent to sort or rename creative assets), script and document editing across multiple revision passes, data extraction and reformatting between file types, batch content generation where the agent writes many files in sequence, and codebase refactoring (the one domain where AI performed adequately, but only barely).

The scale of adoption makes this urgent. Tools like Claude Code and OpenAI Codex are now in the hands of hundreds of thousands of developers and creative professionals running automated workflows daily. Many run these workflows unmonitored. The DELEGATE-52 data says that level of trust is not yet supported by model performance for most document types.

How to Build Safer AI Agent Workflows

The researchers' conclusion is direct: "users still need to closely monitor LLM systems as they operate and complete tasks on their behalf." Here is what that looks like for a practical creative workflow.

Step 1: Add a verification checkpoint after every major operation. Do not chain 10 steps and review the output at the end. After each significant file operation, inspect what the agent produced before authorizing the next step. Catastrophic drops in DELEGATE-52 happened mid-workflow, not only at the end.

Step 2: Keep originals before any agent touches a file. Use git version control for text files. For creative assets, maintain a pre-agent snapshot in a separate folder. Many automation setups skip this in the name of speed.

Step 3: Test on throwaway copies first. Before deploying a new AI workflow on real project files, run the same workflow on dummy data in the same format. Check output at every step. Deploy on real work only after confirming it.

Step 4: Prefer structured formats for delegated tasks. The domain where AI passed was Python code, because code syntax makes errors visible. Where possible, represent your data in structured formats (JSON, CSV, YAML) rather than freeform text. Errors in structured data are harder to hide.

Step 5: Treat file-access tools with caution. Since tool access worsened performance by 6 percent, consider whether file write access is necessary at each step. Some workflows can be redesigned so the AI generates output that a human or deterministic script writes to disk, rather than giving the agent direct write access.

Step 6: Break long workflows into shorter sub-tasks. The DELEGATE-52 study ran 20 interactions. Performance degraded significantly with each additional step. For workflows that must run long, build in human handoffs rather than one continuous 20-step chain.

Frosted glass shield with checkmark representing safer AI agent workflows — Guard rails and verification steps are essential for production agent workflows.

What This Research Does Not Mean

This research does not say AI agents are useless. Python code manipulation passed. Shorter, contained tasks that do not involve multi-step document manipulation are outside the scope of this study. The researchers measured delegated workflows (tasks where you hand off and walk away), not supervised collaboration where a developer reviews each action.

Supervised use of Claude Code and similar tools, where a developer reviews each output, is a different risk profile from fully automated pipelines. The DELEGATE-52 warning applies most directly to automation that runs without human review at each step.

These models will improve. The benchmark exists to drive that improvement. But as of May 2026, with models currently in production, the data is clear: do not trust unmonitored multi-step file operations on content you cannot afford to corrupt.

Frequently Asked Questions

What is DELEGATE-52?

DELEGATE-52 is a benchmark created by Microsoft Research to evaluate how well AI models handle multi-step document manipulation tasks across 52 professional domains. It measures how much content AI agents preserve or lose through extended workflows of up to 20 interactions.

Which AI models were tested in DELEGATE-52?

The study tested Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4, GPT 5.2, GPT 5.1, and GPT 4.1. All are current frontier models available via API.

How severe is the content loss?

Frontier models lost an average of 25 percent of document content over 20 interactions. Across all tested models, the average loss was 50 percent. Only Python programming consistently met the 98 percent preservation threshold.

Does giving AI agents tool access help document preservation?

No. Giving models file read, write, and code execution tools degraded performance by an additional 6 percent on average compared to models without tool access.

Is this a problem specific to one AI provider?

No. The same degradation pattern appeared across models from Google, OpenAI, and Anthropic. This is a general LLM limitation in extended delegation tasks, not specific to any single provider.

Where is the full research paper?

The preprint is at arxiv.org/abs/2604.15597, titled "LLMs Corrupt Your Documents When You Delegate." It was published in April 2026 and reported by The Register on May 11.

Should I stop using AI agents for creative file tasks entirely?

For Python code manipulation, current models are adequate with proper oversight. For other document types, add human checkpoints at each major step, keep backups, and avoid fully unmonitored automation on irreplaceable files.