What I Learned After Burning 3.3B Tokens on a Long Refactor

Jun 04, 2026

I recently let Codex run on a long-horizon refactoring task for a very long time. The branch was real, the work was useful, and the session eventually became one of the best debugging artifacts I have seen for understanding the economics of agentic coding.

I put the small analyzer script I used for this here: Codex session JSONL analyzer.

The scary number was this:

total input tokens: 3,304,313,992 cached input tokens: 3,261,129,088 output tokens: 4,261,261 reasoning output tokens: 1,122,326

At first glance, that looks absurd. 3.3 billion tokens sounds like a runaway agent. But the useful lesson was more specific: this was not 3.3 billion fresh tokens. It was mostly repeated cached context.

After parsing the session logs more carefully, the final estimate looked like this:

summed input tokens: 3,306,418,757 cached input tokens: 3,263,211,776 uncached input tokens: 43,206,981 output tokens: 4,263,613 cache hit rate: 98.7% estimated cost: $1,975.55

The model timeline was simple:

model: gpt-5.5 effort: xhigh span: 2026-05-30T16:13:44Z -> 2026-06-03T09:02:19Z events: 22,064 token accounting events

The model was not the mystery. The expensive part was simpler: each call kept bringing a big chunk of the old session along with it.

Cached Is Cheaper, Not Free

The key ratio is not cached input divided by total input. Cached input is usually a subset of input. The useful ratio is cached input divided by uncached input.

In this run:

cached / uncached ~= 75.5x

That matters because cached tokens are cheaper than fresh input tokens, but they are still billable. If cached input is 10x cheaper than fresh input, then cached input spend equals fresh input spend when:

cached input ~= 10 * uncached input

This session was far past that point. The fresh input was still important, but the dominant input cost was repeated cached-context replay.

That changed how I think about long-running agent sessions. A high cache hit rate is not automatically good or bad. It means the agent is reusing a lot of context. That is great if the context is compact, relevant, and preventing rediscovery. It is expensive if the session is carrying old decisions, noisy tool outputs, stale plans, and broad codebase history that no longer affects the next task.

The Per-Request Distribution Was the Real Speedometer

The cumulative billions were attention-grabbing, but the per-request numbers were more actionable:

input p50: 154,581 input p90: 227,760 input p99: 243,198 input max: 245,345 events above 272K input: 0

In plain English: a normal model call in the later session was carrying roughly 150K to 245K input tokens.

That is the number that tells me whether restarting a session might help. A new session is only cheaper if it shrinks the future working set. If I restart and give the next agent a crisp 5K to 20K handoff, future turns are much cheaper. If I restart and force the next agent to rediscover the whole branch, I just pay fresh input again and lose time.

So the question is not “cached input is high, should I restart?” The question is:

Can I replace this giant accumulated context with a small, accurate handoff?

If yes, restart. If no, keep going or first write the handoff.

Compaction Helped, But It Was Not a Reset

There were 87 compactions in the session. After compaction, the first nonzero input event usually dropped to about 26K tokens:

first nonzero input after compaction p50: 26,131 input before compaction p50: 244,119

That sounds great, and it is. But the context ramped back up quickly:

after 20 calls: p50 input ~= 61,794 after 50 calls: p50 input ~= 93,684 after 100 calls: p50 input ~= 139,180

This made compaction look less like a reset and more like a pressure valve. It kept the session survivable, but it did not change the working style that kept rebuilding a large context.

Tool Churn Was the Other Signal

The session had 25,383 tool calls:

exec_command: 15,998 write_stdin: 8,441 update_plan: 894

The most repeated commands were familiar:

cargo fmt --all -- --check git status --short cargo check --workspace --all-targets rg searches sed reads

None of these are wrong. In fact, they are often the right behavior for a coding agent: inspect, edit, test, verify, commit. The problem is multiplication. A small command repeated hundreds or thousands of times becomes part of the cost profile, especially when tool outputs are fed back into the conversation.

I still want the agent to use tools. I just want fewer tiny loops and more phase-sized checks:

run broader searches once, then summarize the result
avoid repeated status checks that do not change decisions
batch related edits before running expensive verification
keep long-running command output bounded
ask the agent to preserve a concise work log instead of relying on raw history

Long-Horizon Agents Need a Different Loop

I do not think the answer is to tell agents to stop using tools. Tool use is how they stay grounded.

But long-horizon agent work probably needs a different development loop than human-local coding. The usual cycle of inspect, edit, test, status, repeat works well when the context is in a human’s head and the terminal output disappears. In an agent session, that loop leaves residue. Searches, diffs, test output, plans, and status checks all become part of the working history.

That makes token waste partly an SDLC problem. For large refactors, I want phase-level checkpoints, bounded tool output, concise handoff files, and batched verification. A single extra status check is noise. The same habit repeated through a multi-day session becomes real budget.

What I Would Do Differently

For the next long refactor, I would still use an agent. I would still let it run for hours. But I would structure the work differently.

First, I would split the task into phases with explicit handoff files. Each phase gets a short goal, relevant files, decisions made, tests run, and unresolved risks. The handoff should be good enough that a new session can continue without replaying the whole conversation.

Second, I would treat 150K+ per-request input as a warning light. It does not mean stop immediately, but it means the session deserves a checkpoint. At that point I want to know: what context is still useful, what can be summarized, and what can be dropped?

Third, I would monitor tool churn. If the agent is repeatedly running the same status, search, or verification loop, I want it to batch decisions instead of iterating one microscopic slice at a time.

The Real Lesson

The expensive part of agentic coding is not just “the model wrote a lot of tokens.” It is the loop:

large retained context * many model calls * noisy tool feedback

Prompt caching changes the economics, but it does not remove them. Long-horizon agents need context budgets just like production systems need CPU, memory, and I/O budgets.

My new rule of thumb is simple:

Do not restart because the cached-token number is big. Restart when you can replace accumulated history with a smaller, better handoff.

That is the difference between throwing away useful state and actually improving the economics of the next turn.

Art’s Substack

Discussion about this post

Ready for more?