The Plan Is Not the Prompt

An agent ran for an hour and crashed at the end. The transcript is intact. Every tool call, every file it touched, every decision it narrated to itself is sitting right there in the log. And yet the work has lost its place. I can read what happened, but I cannot resume it, because nothing in that hour knows which step was next.

The obvious fix is to give it more context. Paste the plan back in, paste the transcript back in, tell it where it stopped, and let it continue.

That instinct comes from a belief I held without noticing: that the plan lives where the instructions live.

It doesn't. The instructions describe the plan. The model holds an impression of the plan. But the plan as a thing that survives a crash, that you can point at and ask "what is step 37," does not exist anywhere yet. You handed the agent the plan and the workspace in the same envelope, so when the workspace died, the plan died with it.¹

A bigger context window does not change this. It just gives the worker more wall to write on, and the wall goes blank the moment the worker leaves the room.

So the plan has to move. Out of the prompt, into code.

Not into smarter prompting, into ordinary deterministic code: a queue of work items, variables that hold state, a barrier that waits until five things finish before the sixth starts, a cap on how many agents run at once, and a log you can replay. None of that asks the model to remember anything. The queue remembers. The variable remembers. The replay log remembers. The model is handed one item, does it, and returns.

Call the deterministic part the outside plan: the queue, the variables, the barriers, the caps, the replay log, all of it living in code outside any single model's head. The outside plan is a ledger you cannot quietly edit. Each line is dated. Each result is fixed. You trust it to resume for the same reason you trust a lab notebook over your memory of the experiment.

This is the casino move.

A casino is built on randomness it cannot predict and does not try to. It has no idea what the dice will do. But notice everything the casino refuses to leave to chance: the felt is the same size every night, the payout on a hard eight is fixed in advance, the dealer follows a script with no discretion, and the cameras log every hand. The entropy is pushed to one place, the roll, and walled off there. Everywhere else is deterministic on purpose. That is the whole design. Let the dice be wild, and make the table boring.

An agent is the dice. The outside plan is the table.

The model is the only part of the system you cannot predict, and that unpredictability is exactly what you are paying for, the same way the casino is selling the thrill of the roll. But you do not let the roll decide the size of the table. The moment you ask the model to also track which bets are open, hold the running total, and remember whose turn it is, you have moved the table inside the dice. Now the randomness you bought is corrupting the bookkeeping you needed to be exact.

Once the plan is outside, a strange thing happens to scale. A hundred-agent task stops being one heroic hour-long conversation and becomes a hundred short ones. The outside plan hands each agent a slice, the agent does the slice, and the agent returns a receipt: done, here is the output, here is what failed. No agent holds the whole plan, because no agent needs to. The plan is not in any of them. It is in the table they are all sitting at.

Here is the part that surprised me. A memory problem grows with the work. A scheduling problem grows with the number of workers, which is something you set. When the plan lives in the model, every extra hour of work is extra weight on a thing that forgets. When the plan lives outside, the hard part stops growing with the task and starts growing with a number you chose.

This is why the crash stops mattering. If agent number forty dies mid-roll, the outside plan still holds the other thirty-nine receipts and knows forty never came back, so it re-deals forty. You replay from the exact point of failure because the point of failure is a row in a ledger, not a feeling in a room that has since emptied.

But the table only stays exact if you keep it boring, and that is harder than it sounds.

The temptation, always, is to let the center get clever. To have the orchestrator peek at an agent's output and "just decide" what to do next, or summarize three results into one before passing them on, or quietly drop a failed item because retrying is annoying. Every one of those moves smuggles a die back onto the table. The center starts doing domain work, and domain work means judgment, and judgment means entropy, and now the part you needed to be replayable has a mind of its own.

So the rule is severe: the center does no domain work, hides no entropy, and spends intelligence only at the edges. The table routes, counts, waits, logs, and retries. The dice think. The center is dumb so that it can be trusted, and the edges are trusted to be smart because they cannot touch the books.²

Here is the test, and you can run it before you write a single prompt.

Write the outside plan first, with fake agents. Every agent is a stub that returns canned output instantly and for free. Wire up the real queue, the real barriers, the real caps, the real replay log, and run the empty thing end to end. Then kill it halfway and resume it from step 37.

If the empty plan cannot replay from step 37 without spending a cent, your real one was never ready, and no amount of better prompting was going to save it. The plan was still in the room. You find the resume bug while it is free, or you find it an hour and a hundred dollars in.

Run the empty one until the crash is boring. Then, and only then, let the dice in.

Notes

This is the durable-execution model, borrowed from systems that predate agents: split a long-running program into a deterministic conductor and nondeterministic activities whose results are written to an append-only log, so a crash replays the log instead of re-running the side effects. The honest narrower claim is that for short tasks the model's impression of the plan is fine. The failure is specific to long, multi-step, resumable runs, where that impression has to survive a process that ends. See Temporal on durable execution, and the deterministic record-and-replay work it descends from (King, Dunlap, and Chen, "Debugging Operating Systems with Time-Traveling Virtual Machines," USENIX 2005). ↩
The labs converged on this independently: Anthropic's orchestrator-workers pattern, where the orchestrator decomposes and delegates but does not do the domain work, and the same parent-child delegation in OpenAI's Agents SDK and Google's ADK. The narrower claim is that the center may do trivial, deterministic shaping, routing, counting, formatting, without losing replayability. What it must not do is the judgment a replay could not reproduce. The sharpest statement of the rule is Mikhail Rogov's "Why Your AI Orchestrator Should Never Write Code" (2026): the orchestrator must never execute. ↩