2 pointsby obilgic9 hours ago2 comments

tatrions2 hours ago
Snapping the trim point to segment boundaries instead of a naive sliding window is the real insight here. Most caching setups I've seen break down because the prefix shifts by a few tokens every turn and you lose the whole cache.
The multi-step turn savings are what make this really add up though. A single user message triggering 5-6 tool calls means 5-6 API calls where everything before the last tool result is cached. That's where you actually get close to the 10x number.
One thing I'd add: this pairs well with routing simpler turns to cheaper models entirely. Caching saves you on input tokens, but if the turn is straightforward enough that Sonnet or gpt-4.1-mini can handle it, you save on both input and output. The two approaches are complementary.
tatrions4 hours ago
[dead]