Blog/Developer Tools/10 min read/April 18, 2026

Authentic Methods to Save Tokens on Claude: A Developer's Guide

If your Claude API bill looks heavier than your actual output warrants, you're not alone — and you're almost certainly overspending. Most developers don't burn tokens because Claude is expensive; they burn tokens because the defaults are optimized for convenience, not cost. Every unoptimized system prompt, every uncached document, every Opus call that Haiku could have handled — it all compounds. This guide is a no-fluff walkthrough of techniques that genuinely move the needle. No vague "write better prompts" advice. Every method below is either backed by Anthropic's official pricing and documentation or by verifiable patterns in production deployments.

Prompt cachingBatch APIModel routingToken-efficient tool useSystem prompt trimmingClaude Code context management
Share:

Authentic Methods to Save Tokens on Claude: A Developer's Guide

If your Claude API bill looks heavier than your actual output warrants, you're not alone — and you're almost certainly overspending. Most developers don't burn tokens because Claude is expensive; they burn tokens because the defaults are optimized for convenience, not cost. Every unoptimized system prompt, every uncached document, every Opus call that Haiku could have handled — it all compounds.

This guide is a no-fluff walkthrough of techniques that genuinely move the needle. No vague "write better prompts" advice. Every method below is either backed by Anthropic's official pricing and documentation or by verifiable patterns in production deployments.


1. Prompt Caching: the single highest-leverage optimization

If you do nothing else on this list, do this. Prompt caching is the closest thing Anthropic offers to free money for developers with repetitive context.

How the economics actually work. Cache reads cost 10% of the base input token price — a 90% discount. Cache writes cost 1.25× the base price for a 5-minute TTL, or 2× for a 1-hour TTL. That means a 5-minute cache pays for itself after a single cache hit, and a 1-hour cache breaks even after two hits. Everything after that is pure savings.

What to cache. Anything static and long-lived across requests:

  • System prompts (especially long ones with instructions, personas, format rules)
  • Tool definitions for agent workflows
  • RAG documents and knowledge bases injected into context
  • Few-shot examples
  • Large codebases or specs you're asking questions about

How to implement it. Mark the cacheable block with cache_control:

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a technical documentation assistant..."
        },
        {
            "type": "text",
            "text": LARGE_DOCUMENT,  # e.g. 50-page spec
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": user_question}]
)

# Verify the cache worked
print(response.usage.cache_creation_input_tokens)  # first call
print(response.usage.cache_read_input_tokens)      # subsequent calls

Gotchas that will silently break your cache:

  • Minimum block size. For Sonnet and Opus, the cacheable block must be at least 1,024 tokens. Shorter blocks silently fail — the request still succeeds but cache_read_input_tokens stays at 0. Always verify with the usage fields on the response.
  • Byte-for-byte identity. A single changed character in the cached prefix — a timestamp, a randomized key order, a user-specific variable — forces a fresh write. Some SDKs (Swift, Go) randomize JSON key ordering for tool_use blocks, which breaks caches without warning.
  • Order matters. Caching hashes the prefix up to and including the cache_control block, in the order tools → system → messages. Put dynamic content after your cached content, never before.
  • Invalidators. Changing tool_choice or adding/removing images anywhere in the prompt invalidates the cache.

For RAG and agent applications that hit the same context thousands of times a day, teams routinely report cost reductions of 85–95% from caching alone.


2. Use the Batch API for anything that isn't real-time

The Message Batches API processes requests asynchronously within 24 hours and delivers a flat 50% discount on both input and output tokens. No quality trade-off, no special prompting — it's the same model producing the same outputs.

Use it for:

  • Nightly data enrichment or document processing pipelines
  • Bulk content generation
  • Model evaluation runs and regression tests
  • Back-office summarization, tagging, classification

Don't use it for: interactive user-facing requests where latency matters.

The discount stacks with prompt caching. A batched call that hits a cache pays roughly 5% of the base input price — a 95% total reduction. For any workload that tolerates hours of latency, this is the lowest-effort win on this list after enabling caching.


3. Stop using Opus for tasks Haiku can do

Model selection is where most teams leave the most money on the table, because the defaults nudge you toward the most capable model rather than the most appropriate one.

Current pricing per million tokens (as of April 2026):

Model Input Output Best for
Haiku 4.5 $1 $5 Classification, extraction, boilerplate, routing, simple Q&A
Sonnet 4.6 $3 $15 Everyday coding, analysis, most agent work
Opus 4.7 $15 $75 Deep reasoning, architecture decisions, complex refactors

Opus costs 15× more on input and output than Haiku. If 80% of your calls are classification, extraction, or simple transformations, routing those to Haiku while reserving Sonnet or Opus for genuinely complex reasoning will often cut bills by more than half before any other optimization.

Practical routing pattern: implement a cheap classifier (Haiku itself, or a regex) that decides which model handles the request. For agent systems, a common pattern is Opus for planning, Sonnet or Haiku for execution. In Claude Code specifically, the opusplan mode does this automatically — Opus during plan mode, then Sonnet for implementation.


4. Enable token-efficient tool use for agents

If you're building agents with tool use, add one beta header and save tokens immediately. Token-efficient tool use is available via:

anthropic-beta: token-efficient-tools-2025-02-19

This is available for Sonnet and Opus (and Haiku with smaller savings). The tool definition schema and response format don't change — you just add the header. For tool-heavy agent workflows, combined with caching, this contributes to the 60–80% reductions Anthropic cites for agent applications.

For agents with large tool libraries, the Tool Search Tool is also worth investigating — it loads tool definitions on demand rather than forcing every tool's schema into every request's context.


5. Trim your system prompts mercilessly

Every token in your system prompt is paid on every turn of every conversation. A bloated 5,000-token system prompt running across 10,000 daily requests is 50 million input tokens a day — for content that doesn't change.

What actually helps:

  • Delete examples that don't change output quality. Most few-shot examples are there because someone added them during debugging and never removed them. A/B test removing each one.
  • Replace prose instructions with structured formats. Numbered steps and clear headers reduce ambiguity and often produce shorter outputs — saving on both sides.
  • Cut the politeness and framing. "You are a helpful, harmless assistant that carefully considers..." contributes nothing the model doesn't already know. Lead with the actual task spec.
  • Avoid "don't do X" lists. Positive instructions are shorter and more reliable than negative ones.
  • Use the stop_sequences parameter to cap verbose outputs at a natural boundary instead of letting the model ramble to max_tokens.

Related: cap max_tokens to what you actually need. Setting max_tokens: 4096 when you expect a 200-token answer costs nothing extra per call (output is billed on actual tokens), but it does eat into your output token-per-minute rate limits and make retries more expensive when things go wrong.


6. For Claude Code specifically: manage context like a resource

Claude Code sessions drain faster than expected because developers let context accumulate indefinitely. Since Claude is stateless, every turn re-transmits the entire conversation — which means a 30,000-token conversation history gets re-sent with every single message.

High-impact habits:

  • Use /clear when switching tasks. If you're moving from a database migration to a frontend bug, clear the context. The old conversation is dead weight.
  • Use /compact before sessions get long. It summarizes the conversation into a compressed version while keeping the key decisions. Run it proactively rather than when you're already struggling.
  • One task per chat. The temptation to "just keep going" is what turns a $0.50 session into a $15 one.
  • Keep CLAUDE.md under 500 tokens. Every token in that file is loaded before Claude reads anything else — before your code, before your question. Composable CLAUDE.md files (global, project, subdirectory) let you keep rules close to where they apply without bloating any single file.
  • Add a .claudeignore. Stop Claude from reading node_modules, build artifacts, lockfiles, and logs into context. You wouldn't believe how many tokens a casual "look at my project" request wastes on package-lock.json.
  • Point at specific files, not directories. "Look at src/auth/middleware.ts" is wildly cheaper than "look at the auth module." The model wastes tokens exploring to find what you already know the location of.

There are also environment variables that reduce background token consumption:

# Disable non-critical background model calls (suggestions, tips)
export DISABLE_NON_ESSENTIAL_MODEL_CALLS=1

7. Start a manual summary pattern for long agent workflows

For conversational agents or long-running Claude Code sessions where /compact isn't enough, implement a rolling summary pattern:

  1. Every N turns (or when context exceeds a threshold), send the conversation to Claude with a prompt like "Summarize the key decisions, current state, and next step in under 300 tokens."
  2. Start a fresh session with that summary as the opening context.
  3. You've just compressed 30,000 tokens of history into ~300 tokens while preserving the information the next turn actually needs.

For coding workflows specifically, a good handoff summary looks like:

Project: E-commerce checkout API
Current state: Payment processing built, Stripe integration working
Next step: Add error handling for declined cards
Key decisions: Express.js, PostgreSQL, Stripe SDK v14
File: src/payments/processor.ts

8. Use subagents for context-heavy exploration

When you need Claude to investigate something large — search a codebase, analyze logs, grep through documentation — spawn a subagent rather than loading that content into your main session. The subagent does the heavy reading, accumulates the context, and reports back with a short summary. Your main session stays lean and focused on direction and review.

This pattern is especially powerful in Claude Code, where you can literally say "use a subagent to investigate X and report back." The main context only ever sees the conclusion, not the investigation.


9. Measure before you optimize

None of this matters if you can't tell whether it worked. Before changing anything:

  • Pick a representative task — something you do often.
  • Note your current token count and cost. In Claude Code, use /cost at session end for API billing, or check console.anthropic.com → Usage for cross-session data. On a subscription, check claude.ai → Settings → Usage.
  • Apply one change. Run the same task again. Compare.

A few specific things to track:

  • cache_read_input_tokens vs cache_creation_input_tokens on every response — your cache hit rate tells you whether caching is actually doing anything. Any time both are zero when you expected a hit, something invalidated the cache.
  • Tokens per task, not tokens per day — averages hide the outliers that actually drive your bill.
  • Cost per user session if you're running a product — this is the number that determines whether your unit economics work.

Without measurement, you're guessing. And every change you can't measure is a change you can't defend.


Quick-reference priority list

If you're looking at your bill right now and want to know what to do in what order:

  1. Enable prompt caching on anything static over 1,024 tokens. This is almost always the biggest single win.
  2. Move async work to the Batch API. Free 50% discount with zero quality cost.
  3. Route by model. Haiku for simple, Sonnet for most, Opus only when warranted.
  4. Add the token-efficient tools header if you're building agents.
  5. Trim system prompts and CLAUDE.md files. Every persistent token is paid forever.
  6. Adopt session hygiene/clear, /compact, one task per chat.
  7. Measure continuously. No optimization survives without observability.

The compounding effect of these is the point. Caching alone might save you 70%. Caching plus batching plus model routing plus prompt hygiene stacks into territory where a $2,000/month bill becomes a $300/month bill — on the same workload, with the same quality.

The honest trade-off across all of this: every optimization is an investment of developer time. For a solo developer making a few API calls a day, don't bother with cache instrumentation and Redis layers — the subscription plans offer predictable pricing that's simpler to manage. These techniques start paying for themselves when you're running at scale, or when the API bill is a visible line item someone asks about at the end of the month.

Comments

Sign in to join the conversation

Sign up to comment

Ready to build your SaaS?

GitSurfer analyses your idea and generates a complete launch blueprint — OSS stack, infrastructure, cost forecast, and launch checklist — in 30 seconds.

Generate my blueprint — free →