The Claude Code cache bug that's draining your tokens

March 31, 2026

7 min read

Alireza Bashiri

Founder

If your Claude Code Max plan quota has been evaporating faster than usual since late March 2026, you're not imagining it. There's a confirmed bug that silently rebuilds your entire conversation cache on every API call, charging you full cache creation rates instead of reading from cache.

I've been hit by this myself. One session burned through 70% of a Max 5x quota in under 5 hours. Something that should have cost a few cents was costing dollars.

Here's what's actually happening, who found it, and what you can do right now.

What the bug does

Claude Code uses prompt caching to avoid re-sending your full conversation on every turn. Normally, the first message in a session writes the cache (expensive), and every subsequent message reads from it (cheap). Cache reads cost roughly 10x less than cache writes.

The bug breaks this. After certain triggers, the cache stops being read entirely. Every single turn rebuilds the full conversation from scratch. If you have a 300K token conversation, that's 300K tokens of cache creation on every message instead of cache reads.

A developer named jmarianski tracked this down with meticulous token logging. Here's what normal looks like vs broken:

Normal (healthy cache):

cache_read: 318,308  cache_creation: 245   ← reading from cache, tiny writes
cache_read: 319,054  cache_creation: 108   ← cache grows, creation stays small
cache_read: 320,707  cache_creation: 393   ← good

Broken (cache invalidated):

cache_read: 11,428   cache_creation: 224,502  ← only system prompt cached, full rebuild
cache_read: 11,428   cache_creation: 224,953  ← same thing, every single turn
cache_read: 11,428   cache_creation: 228,249  ← still broken, bleeding tokens

That 11,428 is just the system prompt. The rest of your conversation gets rebuilt from scratch every time. At cache creation rates (3.75x normal input), a 230K rebuild costs about $0.086 per message. In a session with 50 tool calls, that's $4.30 gone for what should have been $0.40.

The root cause (reverse engineered)

jmarianski went deep on this. Like, Ghidra reverse engineering the compiled binary deep. What they found is wild.

There's a sentinel string cch=00000 in the billing attribution header that gets replaced with a hash of the request body on every API call. This happens in Anthropic's custom Bun fork, at the native Zig layer, invisible to the JavaScript code.

The problem: this replacement uses memmem to find the first occurrence of cch=00000 in the request body. If that string appears anywhere in your conversation messages (which comes before the system prompt in the JSON), it replaces the wrong one. The billing header keeps its 00000, the conversation gets a hash injected, and the cache prefix changes.

Once the cache prefix changes, the entire cache is invalidated. And since the hash changes with every request body, it never recovers. Every subsequent turn does a full cache rebuild.

This can happen when:

Your CLAUDE.md discusses billing or the cch mechanism
A Read or Grep tool reads a file containing the sentinel
You type the string literally in a message
The tool search feature loads certain content

Who's affected

Everyone using the standalone Claude Code binary (the 228MB ELF). The npm package (npx @anthropic-ai/claude-code) is NOT affected because the sentinel replacement lives in the native layer, not the JavaScript.

From the GitHub thread, multiple users confirmed the issue:

Max 5x users hitting limits in 1 hour instead of days
Max 20x users seeing 70% quota consumed in a single session
One user reported 200-300K token spikes per turn when the cache drops
The bug persists across versions 2.1.86, 2.1.87, and 2.1.88

What to do right now

1. Switch to the npm package (immediate fix)

npx @anthropic-ai/claude-code

This runs the JavaScript version without the native sentinel replacement. The cache bug doesn't exist in this version. Same features, same functionality.

2. Keep sessions short

Don't let conversations grow to 300K tokens. Start fresh sessions for new tasks. The cost of cache creation on a short conversation is negligible. On a long one, it's brutal.

3. Compact proactively

Use /compact at natural breakpoints (after finishing a task) rather than waiting for auto-compact to trigger at 100% context. This reduces the conversation size that needs to be rebuilt if cache does drop.

4. Disable tool search (experimental)

Some users report that setting ENABLE_TOOL_SEARCH=false in your environment helps. Add it to your settings:

{
  "env": {
    "ENABLE_TOOL_SEARCH": "false"
  }
}

This is unconfirmed as a fix but several people reported improvement.

5. Don't edit settings mid-session

Changes to .claude/settings.json or CLAUDE.md during a session can trigger system prompt reloads, which can cascade into cache invalidation.

6. Use skill files to keep sessions focused

This is where I'm biased, but it's genuinely relevant. A skill file gives your agent specific instructions upfront, which means it needs fewer back-and-forth turns to get to the right output. Fewer turns = less cache to rebuild when things go wrong.

If your agent is spending 50 tool calls figuring out the architecture because you didn't give it a skill file, that's 50 opportunities for the cache to break. Give it the patterns and it runs in 5 calls instead.

Anthropic's response

An Anthropic employee acknowledged the issue on X, confirming it's being investigated. No timeline for a fix yet.

The community is also asking for quota resets for affected users, which seems fair. If you're paying for Max and the tool is burning through your allocation due to a bug in their binary, that's not on you.

The bigger picture

This bug is a good reminder of something I keep saying: the tools are powerful, but they're not magic. You need to understand what's happening under the hood enough to know when something is wrong.

Most people hit their quota limit and assumed they were just "using Claude too much." They weren't. A hidden binary-level mechanism was silently multiplying their costs by 10x.

If your token usage looks weird, check jmarianski's diagnostic tool or run the test script from the GitHub issue. Know your numbers.

Sources: