Every Token in Your Context Is a Cost
Cache discipline is the new prompt engineering. Here's how I cut my multi-agent stack tokens by 60% in a week — and why your stack is leaking too.
The month my AI bill crossed my AWS bill, I didn’t panic. I assumed I’d been running agents harder than usual and waited for it to normalize.
It didn’t normalize.
What I had instead was a slow, formless suspicion, the kind you can’t act on because you can’t name it. Something in my stack was wasteful. The invoices confirmed it. But I couldn’t point at a single decision and say: that’s the problem. Every component looked reasonable in isolation. The system as a whole was expensive in a way I couldn’t see.
That’s the invisible cost trap. It doesn’t announce itself. It accrues.
This essay is what I found when I finally looked: a full audit of a 4-agent stack, precise numbers from before and after, a framework I’m calling cache discipline, and seven principles I’m grouping under the Token Telescope. I cut context costs by roughly 60% in a week without changing what the agents could do. The leaks were everywhere. Yours probably are too.
The Audit That Changed Everything
I run a 4-agent stack on a popular multi-agent orchestration platform. Four agents — a CEO, CTO, CMO, and Sr Engineer — coordinating across continuous work queues driven by crons and webhooks.
The audit started with a simple question: what is in context on every agent run, and does it need to be there?
The first thing I found: 615KB of plugin data loaded as system reminders every session. One integration alone, the workflow automation plugin, contributed the majority. It was installed. It wasn’t used. Every session started with half a megabyte of context that did nothing but sit there and cost money.
The second thing: personality prompts. The lead agent carried a SOUL.md and a HEARTBEAT.md: documents defining its communication style and values. Together, 5.6KB. Reloaded on every agent run. Not once. Not daily. Every. Single. Run. The other agents had similar payloads: 1.8KB and 1.4KB respectively. Total per cycle: 7.6KB of personality that made no measurable difference to output quality.
The third: duplicate documentation. Reference files that existed in both the agent instructions and as attachments. The agent read the same content twice on every run because no one had cleaned up the overlap.
The fourth: cron output inline. Status checks ran three times an hour and dumped their full output directly into the next agent prompt. Every hour, the ops context grew. The stable, cacheable prefix kept getting polluted by fresh, variable output.
Simon Willison’s Agentic Engineering Patterns names several of these as engineering patterns: the 92% cache hit rate achievable in well-structured Claude Code sessions, the discipline of never mutating tools mid-session. He’s right about the patterns. What I hadn’t seen named anywhere was the practice: the deliberate, ongoing work of treating your context as a budget and defending it.
There’s a word for that. Cache discipline.
The Cost of Reasonable Choices
Here’s what’s strange about this kind of waste: every single leak was a defensible decision at the time.
The personality prompts? Adding them felt like good engineering. Consistent agent behavior. Branded tone. Real value, or so I assumed. Installing the workflow plugin made sense when I was evaluating integrations. Never removing it was an oversight, but a minor one. The duplicate docs were a copy-paste accident that never got cleaned up. The inline cron output was the simplest way to pass state between runs, and it worked.
None of these were mistakes in isolation. They were the output of reasonable decisions made without a cost lens.
What I didn’t have — what I suspect most agent operators don’t have — is the habit of asking, every time I add something to context: does this need to be here right now? The question sounds obvious. Ask it in retrospect about 615KB of unused plugin data and the answer is obvious. In the moment, mid-sprint, with a feature to ship, you don’t ask. You add.
The sum of all those additions was a 5-to-10x cost multiplier no one was tracking. The CEO agent’s instructions ran ~7KB per run. After the audit: ~3KB. The CTO: ~10KB down to ~8KB. The Sr Engineer: ~3.7KB down to ~2KB. These numbers don’t look dramatic on a spreadsheet. Multiply them by hundreds of agent runs per day, add the plugin bloat, add the cron pollution, and you’re burning tens of thousands of tokens per hour on things that are invisible on any single invoice line.
I wasn’t running a wasteful stack. I was running a normal stack. That’s the point.
Who This Is For (And Who It Isn’t)
A note before we go further: cache discipline matters most if you’re running 4+ agents, 100+ daily runs, or spending $1K+/month on LLM inference. Below that threshold, the absolute dollar savings are small enough that the time cost of auditing probably exceeds the return. If you’re running two agents for a side project at $5/month, the audit takes 5 minutes and might find something — but the rest of the framework is premature for you. That’s okay. The patterns are still worth knowing for the moment you cross the threshold; just don’t feel obligated to restructure your stack today.
Cache Discipline: The New Prompt Engineering
Three years ago, “prompt engineering” was crystallizing as a practice. People had been writing prompts for longer than that, but the community was just beginning to name what worked: few-shot examples, chain-of-thought reasoning, role assignment, output formatting. The naming made it teachable. It created vocabulary. It turned intuitions into transferable skills.
Cache discipline is at that same stage right now.
Everyone running a non-trivial agent stack has some instinct about context size. They’ve noticed that some sessions feel faster. They’ve vaguely tried to keep system prompts lean. They’ve maybe heard that prompt caching exists. But it’s not a practice yet — it’s a feeling.
Here’s the definition I’m working with: cache discipline is the practice of treating every token entering context as a cost, every cache hit as savings, and every prompt design decision as a budget allocation.
The hardest part to internalize is the budget piece: cache keys are prefix-based, so the moment anything variable appears before your stable content, you’ve busted the cache for everything that follows. Prompt layout is cost architecture.
This is not prompt engineering. Prompt engineering is about output quality: getting the model to say the right thing. Cache discipline is about cost: making sure you don’t pay for context that doesn’t earn its place.
This is not FinOps. FinOps looks at invoices. Cache discipline looks at the moment before the token is sent: the architectural choice, the cron script, the config file. It asks whether the cost is warranted.
The Token Telescope: 7 Principles
The Token Telescope is the framework I use to apply cache discipline in practice. Seven principles, in the order I’d apply them to a new stack. They aren’t all equal weight — Principle 2 is the lens through which the other six get applied.
1. Trunk & Fork
The trunk session stays lean. Its only job is coordination. All substantive work happens in disposable subagent forks that inherit the cached prefix and get discarded on completion. The trunk prefix stays hot, so every fork starts cheaper. In my stack, the ops-lead agent respawns every 8-10 cycles to prevent context accumulation — before that change, 20 cycles generated 5-6MB of accumulated context. After: capped and predictable.
2. The Token Telescope (the meta-principle)
The framework is named after this principle for a reason. It’s the lens; the other six are what you see through it. There is exactly one question every other principle is trying to help you answer:
Does this need to be in context right now?
Not eventually. Not in case. Not because it was useful last month or because removing it feels risky. Right now, for this run, for the work the agent is actually doing in the next 30 seconds. If the answer isn’t a clean yes, the thing belongs somewhere else: on disk, in a tool call, fetched on demand, or deleted.
The question is harder than it sounds because most context decisions are never decisions at all. You don’t sit down and choose to load 615KB of plugin data — it’s just there because you installed the plugin three weeks ago. You don’t choose to attach a personality document on every run — it became part of the agent’s standard payload when you wrote it and you’ve never re-evaluated. The Token Telescope is the habit of pulling each of those silent additions into the foreground and asking the question out loud.
Three examples from my own stack make the lens concrete:
- Clear no. The workflow automation plugin loaded 615KB every session. The agents never invoked it. The Token Telescope would have caught this at install time: does it need to be in context right now, doing useful work? No. Install on demand, or not at all.
- Clear yes-but-not-here. Cron output. Useful when the agent needs it, irrelevant the rest of the time. The instinct is to pipe it inline so it’s always there just in case. The Token Telescope reframes that — does this need to be in context right now? Only when the agent is making a decision that depends on it. The output goes to disk; the agent reads the file when it actually needs the data.
- Genuine edge case. Personality documents. Here the answer isn’t always no, and the Token Telescope doesn’t pretend it is. For a customer-facing agent holding a multi-turn conversation, voice consistency might be load-bearing — tone drift between turns is a real failure mode. For a stateless coordination agent that prints status updates, voice is decoration. Same question, different answers, and the question itself is what surfaces the difference. What the Token Telescope tells you to do here is run the test. Same prompt with the doc, same prompt without, ten times each. Look at the outputs. If you can’t tell the difference, the doc isn’t earning its tokens. Decide on evidence, not on the warm feeling that it ought to matter.
The other six principles are specialized applications of this one question. Stable Prefix asks it about ordering: does this variable thing need to appear before the stable content right now? Instruction Hygiene asks it about the prose you wrote months ago: does this paragraph need to be re-read on every run, or did you write it for yourself and not the model? File-Based State, Trunk & Fork, Event-Driven Over Polling — they’re all variations on the same lens applied to different parts of the stack.
This one question changes everything. Ask it about every addition to context — at install time, at design time, at code review time, at the moment you’re about to paste something into a system prompt. Most of the cost savings came from asking this question, honestly, about things I’d added without ever asking it the first time. Not heroic optimization — just attention.
3. Stable Prefix
Cache keys are prefix-based. If the first token of your prompt varies between runs, you have no cache. The fix: a visible --- CACHE BOUNDARY --- marker in your system prompts. Everything before the marker must be identical between sessions. Everything after can vary freely without touching the cached prefix. I use this in every agent instruction file now. It’s the first thing I check in any new stack.
4. File-Based State
Cron prompts don’t need to carry their output. They need to trigger a read. Script outputs go to disk; the agent reads the file on demand. Before this change, status checks dumped full output inline — ~200 tokens of variable context per run, 3x/hour. After: the cron prompt is static boilerplate, ~10 tokens. The state is on disk. The cache is intact.
5. Agent Instruction Hygiene
Kill SOUL.md. Kill HEARTBEAT.md if it duplicates content in your main instructions file. Merge or delete. Personality documents are token waste in most stacks. The model doesn’t need a prose description of its values re-read on every run. The CMO agent in my stack runs on 813 bytes. Single file. No fluff. That’s the target. My cuts: CEO instructions dropped from ~7KB to ~3KB. Engineer from ~3.7KB to ~2KB.
Validation note — quality impact. I removed personality docs from my stack and saw no quality degradation on spot checks. That’s specific to my use case: stateless agents doing short interactions, none of them customer-facing. Your stack might differ. Before deleting personality docs from yours, run three tests:
- Consistency. Same prompt 10x with the doc, 10x without. Compare output variance.
- Multi-turn. If your agents hold multi-turn conversations, check whether tone holds across exchanges.
- Reasoning-heavy. If your agents do planning or analysis, check whether decision quality degrades when you cut the doc.
Treat the test as the principle, not the deletion. Whether personality docs survive your stack depends on whether they actually change output, and you won’t know that without measuring.
6. Event-Driven Over Polling
A WebSocket listener is infrastructure. It costs zero tokens. A cron that polls for events three times an hour is an agent. It burns context and busts cache. Replacing polling crons with a listener didn’t just reduce token use; it eliminated an entire class of cache perturbation. Cron-based cache busting went from 3x/hour to 1x/hour — a 66% reduction just from consolidation, before the listener was fully operational. (Caveat: this principle assumes your platform supports event-driven hooks. Not all do; see the Validation Notes for what to do on platforms that don’t.)
7. The 5-Minute Audit
Five questions. Answer them about your stack:
- What is loaded in every agent’s system prompt? What of that is actually used on every run?
- Do your agents have personality or style documents? When did you last validate they change output quality?
- Are your cron prompts variable? Are they dumping state inline?
- Where is your cache boundary, and is it stable between sessions?
- What was in context on the last 10 runs that wasn’t in context on the first run you ever ran?
The fifth question is the one that finds the drift. Stacks accumulate. Files get added. References get duplicated. Nobody audits it because the cost is invisible — until you look.
The Numbers
Before and after, from the audit:
| Area | Before | After | Saving |
|---|---|---|---|
| Plugin bloat in system prompt | 615KB (one integration alone) + 3 unused plugins | Removed | ~20 system-reminder entries eliminated |
| Agent instructions per run | CEO: ~7KB, CTO: ~10KB, Engineer: ~3.7KB | CEO: ~3KB, CTO: ~8KB, Engineer: ~2KB | ~5-8KB per agent run |
| Ops context per session | ~27KB | ~15KB | ~12KB per session |
| Cron cache perturbations | 3x/hour | 1x/hour | 66% fewer cache busts |
| Agent personality bloat | 5.6KB (CEO), 1.8KB (CTO), 1.4KB (Engineer) | 1.2KB, 0KB, 0KB | 7.6KB total per agent run cycle |
| Ops-lead context accumulation | 5-6MB over 20 cycles | Respawn every 8-10 cycles | ~75% less accumulated context |
Projected daily: 60,000-180,000 tokens saved per day across 4 agents on Ollama Cloud inference. The range is wide because agent run frequency varies. The floor is real.
The 60% headline reduction comes from the combined effect: fewer cache misses compounding across every session start and agent run. No single change delivers 60%. The discipline delivers 60%.
How I measured this. I counted actual API tokens consumed by the stack, not file sizes extrapolated to token counts. The 4-agent stack ran for several days pre-optimization, then several days post-optimization. Per-session and per-run token totals came from the inference layer. The ~60% figure is the measured delta in aggregate token consumption across comparable workloads — read it as directional. The precise figure shifts with workload composition and the measurement window. The projected daily range (60K–180K) extrapolates further across run frequency. My agents ran between ~150 and ~450 runs/day during the observation window; that’s where the spread comes from. The percentage is the measured part. The daily token number scales with whatever your workload looks like.
Source: a 4-agent stack on a popular multi-agent orchestration platform. One week of focused audit and restructuring. No changes to what the agents could do — only to what was loaded into their context and when.
Try the Audit Yourself
You don’t need a full week. Start with 5 minutes.
Step 1. Open the system prompt for your highest-traffic agent. Read it like you’re paying per word — because you are. Mark anything that doesn’t need to be there on every run.
Step 2. List every plugin, integration, or tool included in your platform’s system reminders. When did you last use each one?
Step 3. Find your cron jobs or scheduled prompts. Are they passing state inline? Are they variable? What does that variability cost your cache prefix?
Step 4. Look for your stable-vs-variable boundary. Is it explicit? Does anything variable appear before anything stable?
Step 5. Search your agent instructions for personality, tone, values, or style documents. Weigh each one. If you removed it, would output quality change? Run the experiment.
If you found something in step 1, you have a leak. If you found something in every step, you have a system. Most stacks do.
Where This Leaves You
Cache discipline is a question you learn to ask: does this need to be in context right now? Nothing to install, nothing to configure — just the question, applied early and often. Your stack leaks because, like mine did, it’s the output of reasonable decisions made without a cost lens. Once you have the lens, the leaks stop being invisible. You start to see them at install time, in the cron script you wrote six months ago and haven’t touched since, in the personality doc you added when you thought it mattered. The discipline is the seeing.
If you want to run the 5-Minute Audit this week with templates and worksheets to do it cleanly, I’m packaging the Token Telescope Audit Kit: the full framework, the expanded principles, a self-audit worksheet, before/after templates, and the decision trees I use when I’m uncertain whether something earns its tokens. Sign up at tokentelescope.com to get it when it lands, plus the weekly writing on what’s actually working in multi-agent cost optimization. No noise; just the patterns as I find them.
The leaks were everywhere in my stack. They’re probably everywhere in yours.
Go look.
Validation Notes: What I’m Still Testing
This essay describes a working framework applied to one stack. Several claims would be stronger with more rigorous evidence than I have today. Rather than hedge them inside the prose or pretend they’re settled, I’m flagging them openly. If you’re considering applying the framework, these are the places to test before generalizing.
1. Personality prompts having “zero functional value”
Claim: Removing SOUL.md / HEARTBEAT.md and other personality docs produced no measurable quality drop.
Current evidence: Spot checks of outputs before and after removal across a few dozen runs. No obvious differences in tone, correctness, or coherence on the agents in my stack (largely stateless, not customer-facing).
What would strengthen this:
- A/B with the same prompt run 10x with the personality doc and 10x without, with output variance measured rather than eyeballed.
- Multi-turn conversations, where personality plausibly stabilizes tone across exchanges.
- Reasoning-heavy agents (planning, analysis, debugging) where context can condition reasoning quality without changing surface output.
Status: Partially validated for my use case (stateless coordination agents). Unvalidated as a general claim. If your agents are customer-facing or reasoning-heavy, treat my “zero value” finding as a hypothesis to test, not a result to copy.
2. The 60% measurement methodology
Claim: Cache discipline cut my stack’s context costs by ~60% in a week.
Current evidence: Per-session and per-run token counts from the inference layer, compared pre- and post-optimization across multi-day observation windows on comparable workloads.
What would strengthen this:
- Holding workload composition constant across the pre/post windows (my agents weren’t doing identical tasks across the two periods).
- Controlling for day-of-week and traffic variation in run frequency.
- Cross-platform replication on a different inference provider with different caching behaviour.
Status: Defensible at the order-of-magnitude level (the savings are clearly large). The precise “60%” is directional — the exact figure shifts with workload composition and the measurement window. Treat it as a representative reduction on the audited stack, not a guaranteed delta on any stack.
3. Generalizing 4 agents to “your stack”
Claim: The principles transfer to other stacks; the audit findings are representative, not idiosyncratic.
Current evidence: The waste categories (unused plugins, personality bloat, inline cron output, prefix-busting variability) are mechanical consequences of how prompt caching works and how teams accumulate context. I’d expect them in most stacks.
What would strengthen this:
- The framework applied on a substantially larger stack (50+ agents or 10K+ runs/day), where orchestration patterns change non-linearly (batching, pooling, retries, queueing).
- Replication on different orchestration platforms with materially different caching behavior.
- Documentation of which principles are universal vs. scale-dependent.
Status: The diagnostic questions in the 5-Minute Audit are platform-portable. The specific architectural fixes (Trunk & Fork, event-driven over polling) may need adaptation at significantly larger scale. If you’re running at much larger scale than four agents, I’d love to hear what generalizes and what doesn’t.
4. Event-Driven Over Polling infrastructure assumption
Claim: Replacing polling crons with event-driven listeners eliminates a class of cache perturbation and is broadly applicable.
Current evidence: It worked in my stack on a platform that supports webhook-style triggers. The token savings and cache-perturbation reduction are real.
What would strengthen this:
- Testing on platforms without native event-driven support (some Ollama setups, some Claude Code configurations).
- Quantifying the trade-off — listeners are “free” in tokens but not free in infrastructure (latency, complexity, uptime).
- A scale threshold below which polling is simpler and cheaper than standing up listener infrastructure.
Status: Real but platform-conditional. If your platform doesn’t expose webhook/listener primitives, the closest portable substitute is “consolidate polling: fewer crons, longer intervals, smaller payloads.” That captures most of the cache-perturbation win without requiring infrastructure you don’t have.
Armin — builder at kern.web.za and mftplus.co.za. Writing about what actually works in multi-agent systems.