How Claude API pricing works
Claude API costs scale with the number of tokens you send (input) and the number of tokens you receive (output). Pricing differs by:
- Model tier — Haiku (lowest cost), Sonnet (mid), Opus (highest cost). Higher tiers have stronger capability and higher per-token price.
- Token type — input tokens cost less than output tokens. The asymmetry matters because output tokens dominate cost on generative workloads.
- Caching — cached input tokens are dramatically cheaper than uncached. Anthropic's prompt caching is unusually efficient.
- Distribution channel — direct Anthropic API has list pricing; AWS Bedrock, GCP Vertex AI, Microsoft Azure include hyperscaler markup but consolidate billing.
- Volume commitments — enterprise contracts include committed-use discounts on volume.
The live pricing is published at anthropic.com/pricing; specific numbers move over time. NINtec's Discovery phase produces a workload-specific cost projection.
What a token actually costs you
Approximate framing for practical reasoning:
- Haiku: cents per thousand output tokens — suitable for high-volume routing and classification
- Sonnet: a few cents per thousand output tokens — workhorse for most enterprise workloads
- Opus: tens of cents per thousand output tokens — reasoning-heavy or high-stakes workloads
The English-language token-to-word ratio is roughly 1.3 tokens per word. So 1000 output tokens is roughly a 750-word response.
For a customer-service deflection workload averaging 200-token responses at scale of 10K conversations/day on Sonnet, the API cost is in the low hundreds of dollars per day — small relative to the engineering cost it enables.
Prompt caching dramatically changes economics
Anthropic's prompt caching is operationally important. The mechanism: when a prompt has a long stable prefix (a system prompt, retrieval context, document set), the prefix can be cached. Subsequent requests reusing the cached prefix pay a much lower per-token rate for the cached portion.
Workload shapes that benefit most:
- RAG systems with stable retrieval corpora — cache the corpus, pay full price only on the per-query variable portion
- Long system prompts with consistent role/guardrail instructions — cache the system prompt across all user messages
- Multi-turn conversations with stable context — cache the early turns
- Customer-service deflection with a large knowledge base — cache the knowledge base
For cache-friendly workloads, prompt caching reduces cost by 30–70%. Architecture decisions that maximise cache hit rate compound into substantial cost savings at scale.
Provisioned throughput for predictable workloads
Provisioned throughput (PTU) is committed-capacity pricing — pay for a fixed throughput allocation rather than per-token. Right for:
- Steady-state high-volume workloads where capacity needs are predictable
- Workloads with strict latency SLAs that benefit from reserved capacity
- Cost-predictability requirements that favour fixed monthly spend over variable per-token billing
- Peak-period workloads (retail seasonal, fintech month-end) where burst capacity matters
PTU is wrong for sporadic or low-volume workloads — fixed cost without enough utilisation is wasted spend. NINtec's Discovery includes a PTU recommendation based on your workload's actual shape.
Cost telemetry as engineering discipline
We instrument every production Claude deployment with cost telemetry from day one:
- Per-tenant cost dashboards (essential for multi-tenant SaaS)
- Per-feature cost attribution (which features drive cost)
- Per-prompt cost tracking (which prompts are expensive)
- Cost-per-outcome metrics (cost per successful customer deflection, per resolved exception)
- Anomaly alerts on unexpected cost spikes
Without cost telemetry, the engineering team cannot make informed trade-offs. With it, optimisation becomes data-driven.
Cost optimisation patterns
Common optimisations we deploy:
- Model routing — small/cheap model for high-volume routing decisions, larger model for synthesis
- Prompt simplification — many production prompts have unnecessary words; shorter prompts cost less
- Aggressive prompt caching — architect retrieval and system prompts for maximum cache hit rate
- Response budgeting — set max_tokens to a tight envelope; long responses cost more
- Batch processing — for offline workloads, Anthropic's batch API offers discounted pricing
- Speculative decoding — for high-throughput workloads where supported
Most clients see 15–30% cost reduction in the first quarter of optimisation work.