New Microsoft paper shows that current AI assistants often damage documents during long editing jobs.
Even the frontier models still ended up corrupting about 25% of document content on average, while many other models damaged far more.
The problem is that delegated AI work only makes sense if a model can keep a document correct across many edits, not just do 1 step well.
The paper tests this with reversible task pairs, where a model edits a file and then tries to undo that edit, so a reliable system should return to the original document.
The authors built real work setups across 52 domains, from coding and science to accounting and music notation, and ran 19 models through 20 editing interactions.
The failures were usually not lots of tiny slips but occasional big mistakes that silently broke parts of the document and then compounded over time.
Agentic tool use did not help in their tests, and bigger files, longer workflows, and irrelevant extra documents made the corruption worse.
The reason this matters is that current LLMs can look strong in short demos or narrow coding tasks yet still be unreliable delegates for long real-world document work.
----
Paper Link – arxiv. org/abs/2604.15597
Paper Title: "LLMs Corrupt Your Documents When You Delegate"