Skip to content
OBLAIDISH NEWS
Self-hosted Claude Code speedup: caching fix eliminates 15× slowdown
TX_797693AI

Self-hosted Claude Code speedup: caching fix eliminates 15× slowdown

Self-hosted Claude Code ran 15× slower because a rotating billing header broke caching in vllm‑mlx’s SimpleEngine; a shim and upstream patch restore caching and cut latency to 7‑8 seconds.

sources[devto]

Vinay observed that self‑hosted Claude Code was 15 times slower than expected, tracing the slowdown to two issues: a rotating 81‑byte x-anthropic-billing-header and the absence of caching in vllm‑mlx’s SimpleEngine [devto].

The billing header changed on every request, invalidating any cache key. Vinay solved this by inserting a shim that strips the header from the payload before it reaches vllm‑mlx, ensuring a stable cache identifier.

He also patched SimpleEngine to add a single‑slot, hash‑keyed system‑prefix KV cache. The patch detects the system prefix using ChatML markers, restores the saved KV snapshot, and only prefills the suffix on a cache hit. The change landed upstream as PR #523 [devto].

With both the shim and the prefix‑cache patch in place, warm turns now complete in 7–8 seconds—a 13–15× speedup over the original latency.

Why it matters: the fix demonstrates that even tiny protocol details—an 81‑byte header—can cripple performance when caching is involved. Engineers building LLM inference stacks can adopt the shim and SimpleEngine patch to reclaim latency gains without redesigning their entire stack. The upstream PR also makes the improvement available to any vllm‑mlx deployment, lowering the barrier for high‑throughput, self‑hosted Claude Code usage.

operator_channel
[ comments_offline · provider_not_configured ]
transmission_log

Subscribe to the broadcast.

Daily digest of the day's most important tech news. No fluff. Engineering signal only.

// delivered via substack · double-opt-in confirmation