Silent Mutilation: Why Delegating File Edits to LLMs is a Systems Failure
Silent Mutilation: Why Delegating File Edits to LLMs is a Systems Failure
I messed up badly last month. I thought I was being incredibly clever. I had inherited a massive monorepo littered with legacy any types, and instead of spending my weekend manually typing them out, I built an autonomous agent loop. The instructions were simple: read the TypeScript file and infer the types from usage before writing the file back to disk.
After kicking off the script and going to sleep, I woke up to 400 beautifully typed files. Then I ran the tests. They failed. The types were correct. However, the agent had silently stripped out dozens of subtle domain-specific comments. It flattened three highly optimized custom sorting algorithms into generic Array.prototype.sort() calls, while also erasing a bizarre but necessary workaround for a five-year-old Safari bug.
I had delegated my codebase to a statistical engine, and it rewarded me by reverting my hard-won domain knowledge back to the lowest common denominator of its training data. I created a monster because I misunderstood the tool I was holding.
The Anatomy of Semantic Ablation
We have a fundamental misunderstanding of what large language models actually are. When you hand an LLM a 2,000-line file and say, "add a boolean flag to the configuration object," you are not interacting with a hard drive. You are playing a high-stakes game of telephone.
Every time an entire document passes through the attention heads of a generative model, it undergoes a process I have come to call "semantic ablation." You might recognize it mathematically as regression to the mean. The model analyzes your highly specific, slightly eccentric code and calculates the most statistically probable sequence of tokens to output. That Safari workaround? Statistically unlikely. That verbose but necessary comment explaining why you avoided a standard regex? An anomaly. So, the model quietly smooths them out. It rounds off the edges.
If you string continuous delegation tasks together in a loop, even the absolute state-of-the-art models will mangle a significant portion of your original document. The degradation is insidious because it is completely silent. The code still compiles with flawless syntax. The lint rules pass as well, but the soul of the architecture bleeds out with every iterative pass.
I have watched teams fall into the psychological trap of the "toy example." They run a prototype agent on a 50-line utility file. It works perfectly. They extrapolate that success to a 3,000-line core routing file, and suddenly they are hemorrhaging context.
We regularly see three distinct failure modes when piping whole files through an LLM:
- The Ghost Comment: The model does not understand the temporal value of
// TODO: hacky flexbox fix, revert after Q3. It sees dead code or bad practice and simply drops it from the output. - The Formatting Drift: Your file uses tabs. The model heavily favors spaces based on its pre-training. It subtly reformats the file, completely polluting your git history and making code review impossible.
- The Impatient Refactor: The model gets exhausted near the end of a long context window and outputs
// ... rest of the code remains the same. The naive scripting harness literally writes that string to your production file, deleting half your application logic.
The failure lies in our systems architecture rather than the AI. We force a probabilistic engine to do deterministic work. When you pipe raw file strings directly into an LLM and pipe the output directly back to the file system, you wrap your critical assets in a layer of plausible bullshit.
Decoupling Intent from Execution
System thinking demands that we isolate non-deterministic components. You would never put a random number generator in the middle of your database transaction commit phase. So why do we put an LLM in charge of preserving string exactness across thousands of lines of code?
The fix requires a paradigm shift: decouple intent from execution.
The LLM should never output the full content of a file. Its job is solely to analyze the context and translate your natural language intent into deterministic edit instructions. The actual modification of the file system must be handled by rigid, traditional code that behaves predictably.
Instead of asking the model to rewrite the file, you ask it to generate a JSON payload containing specific search-and-replace blocks, or precise line-insertion commands. The model acts as a surgical planner. Your execution harness is the scalpel. If the model's search string does not match the source file exactly, the harness throws an error and aborts. No silent failures. No regression to the mean.
Building a Surgical Edit Harness
Let us look at how this actually works in practice. We will build a strict edit harness in TypeScript. First, we define the boundary. We need a strict schema that forces the model to express its intent as discrete operations.
import { z } from "zod";
import { readFileSync, writeFileSync } from "fs";
// 1. Define the strict contract for the LLM
export const SurgicalEditSchema = z.object({
thoughtProcess: z.string().describe("Briefly explain the intent behind the change"),
operations: z.array(
z.object({
filePath: z.string(),
exactSearchString: z.string().describe("Must match the target text EXACTLY, including whitespace"),
replacementString: z.string().describe("The new text to insert"),
})
),
});
type SurgicalEdit = z.infer<typeof SurgicalEditSchema>;By forcing the model into this schema, we strip away its ability to silently modify unrelated parts of the file. If it wants to change a line, it has to declare exactly what it is replacing.
Next, we need an execution engine that aggressively validates these operations. This code is entirely deterministic.
class ExecutionHarness {
public applySurgicalEdits(editPlan: SurgicalEdit): void {
console.log(`Executing plan: ${editPlan.thoughtProcess}`);
for (const op of editPlan.operations) {
this.processOperation(op);
}
}
private processOperation(op: SurgicalEdit['operations'][0]): void {
const fileContent = readFileSync(op.filePath, 'utf-8');
// Count occurrences to ensure we don't accidentally replace the wrong block
const occurrences = fileContent.split(op.exactSearchString).length - 1;
if (occurrences === 0) {
throw new Error(`Execution failed: Could not find exact match for search string in ${op.filePath}. The LLM hallucinated the target.`);
}
if (occurrences > 1) {
throw new Error(`Execution failed: Search string in ${op.filePath} is not unique. Please prompt the LLM to provide more context lines.`);
}
const updatedContent = fileContent.replace(op.exactSearchString, op.replacementString);
writeFileSync(op.filePath, updatedContent, 'utf-8');
console.log(`Successfully patched ${op.filePath}`);
}
}Notice the strictness. The processOperation method avoids guessing and fuzzy matching. If the exactSearchString provided by the model deviates by a single space from the target file, the operation fails loudly.
This is exactly what you want. It forces the LLM into a tight feedback loop where it must be precise, rather than letting it lazily rewrite the entire file and hallucinate away your comments. If the operation fails, you can catch the error and feed it back to the LLM, asking it to adjust its search string.
The Next Level: AST Delegation
String replacement is vastly superior to whole-file generation, but it still breaks down on massive, complex refactors. For advanced use cases, I prefer delegating the scripting of the change rather than the string diff.
Instead of outputting text patches, you instruct the LLM to write a short Abstract Syntax Tree (AST) manipulation script using a tool like ts-morph.
import { Project } from "ts-morph";
// You prompt the LLM: "Write a ts-morph script to find all interfaces
// ending in 'Data' and append 'DTO' to their names."
// The LLM returns THIS execution script, not the modified source code:
function runLlmGeneratedMigration() {
const project = new Project({ tsConfigFilePath: "./tsconfig.json" });
const sourceFiles = project.getSourceFiles("src/**/*.ts");
for (const sourceFile of sourceFiles) {
const interfaces = sourceFile.getInterfaces();
for (const iface of interfaces) {
const currentName = iface.getName();
if (currentName.endsWith("Data")) {
iface.rename(`${currentName}DTO`);
}
}
}
project.saveSync();
}By shifting the burden of modification from text prediction to AST manipulation, you completely eliminate the "telephone game" risk. The model writes the migration logic. The script runs deterministically against your codebase. The AST parser handles all the whitespace and imports natively, while also preserving comments.
If the LLM writes a bad script, it throws a type error or a runtime exception before touching your files. If it writes a good script, it executes perfectly, leaving all unrelated formatting and structure completely untouched.
The Philosophy of Boundaries
We demand too much of our models, and far too little of our engineering harnesses. The rush to build autonomous agents has blinded many developers to basic software design principles. We threw out decades of established practices regarding immutability and pure functions, along with rigid boundaries, all because a chat interface looked vaguely like a terminal.
Automation focuses on removing noise rather than removing humans from the loop. By narrowing the LLM's surface area to just generating localized diffs or deterministic commands, we isolate the probabilistic layer and protect the integrity of our systems.
If you are building workflows that rely on AI, treat the model as a highly intelligent but fundamentally unreliable translator. Confine it to a narrow, well-defined sandbox. Pass the results of that sandbox into rigid and deterministic execution engines.
Stop handing your entire codebase over to a probability matrix. Build smaller tools and define sharper boundaries. This keeps your hard-earned domain logic safely out of the ablation zone.