Cost & Pricing

Claude API Pricing Explained

Claude API pricing is per-token (input and output priced differently) and varies by model tier. For typical enterprise workloads the API cost is a small fraction of total engagement cost. Anthropic's prompt caching and provisioned-throughput options change the economics meaningfully on certain workload shapes.

How Claude API pricing works

Claude API costs scale with the number of tokens you send (input) and the number of tokens you receive (output). Pricing differs by:

  • Model tier — Haiku (lowest cost), Sonnet (mid), Opus (highest cost). Higher tiers have stronger capability and higher per-token price.
  • Token type — input tokens cost less than output tokens. The asymmetry matters because output tokens dominate cost on generative workloads.
  • Caching — cached input tokens are dramatically cheaper than uncached. Anthropic's prompt caching is unusually efficient.
  • Distribution channel — direct Anthropic API has list pricing; AWS Bedrock, GCP Vertex AI, Microsoft Azure include hyperscaler markup but consolidate billing.
  • Volume commitments — enterprise contracts include committed-use discounts on volume.

The live pricing is published at anthropic.com/pricing; specific numbers move over time. NINtec's Discovery phase produces a workload-specific cost projection.

What a token actually costs you

Approximate framing for practical reasoning:

  • Haiku: cents per thousand output tokens — suitable for high-volume routing and classification
  • Sonnet: a few cents per thousand output tokens — workhorse for most enterprise workloads
  • Opus: tens of cents per thousand output tokens — reasoning-heavy or high-stakes workloads

The English-language token-to-word ratio is roughly 1.3 tokens per word. So 1000 output tokens is roughly a 750-word response.

For a customer-service deflection workload averaging 200-token responses at scale of 10K conversations/day on Sonnet, the API cost is in the low hundreds of dollars per day — small relative to the engineering cost it enables.

Prompt caching dramatically changes economics

Anthropic's prompt caching is operationally important. The mechanism: when a prompt has a long stable prefix (a system prompt, retrieval context, document set), the prefix can be cached. Subsequent requests reusing the cached prefix pay a much lower per-token rate for the cached portion.

Workload shapes that benefit most:

  • RAG systems with stable retrieval corpora — cache the corpus, pay full price only on the per-query variable portion
  • Long system prompts with consistent role/guardrail instructions — cache the system prompt across all user messages
  • Multi-turn conversations with stable context — cache the early turns
  • Customer-service deflection with a large knowledge base — cache the knowledge base

For cache-friendly workloads, prompt caching reduces cost by 30–70%. Architecture decisions that maximise cache hit rate compound into substantial cost savings at scale.

Provisioned throughput for predictable workloads

Provisioned throughput (PTU) is committed-capacity pricing — pay for a fixed throughput allocation rather than per-token. Right for:

  • Steady-state high-volume workloads where capacity needs are predictable
  • Workloads with strict latency SLAs that benefit from reserved capacity
  • Cost-predictability requirements that favour fixed monthly spend over variable per-token billing
  • Peak-period workloads (retail seasonal, fintech month-end) where burst capacity matters

PTU is wrong for sporadic or low-volume workloads — fixed cost without enough utilisation is wasted spend. NINtec's Discovery includes a PTU recommendation based on your workload's actual shape.

Cost telemetry as engineering discipline

We instrument every production Claude deployment with cost telemetry from day one:

  • Per-tenant cost dashboards (essential for multi-tenant SaaS)
  • Per-feature cost attribution (which features drive cost)
  • Per-prompt cost tracking (which prompts are expensive)
  • Cost-per-outcome metrics (cost per successful customer deflection, per resolved exception)
  • Anomaly alerts on unexpected cost spikes

Without cost telemetry, the engineering team cannot make informed trade-offs. With it, optimisation becomes data-driven.

Cost optimisation patterns

Common optimisations we deploy:

  • Model routing — small/cheap model for high-volume routing decisions, larger model for synthesis
  • Prompt simplification — many production prompts have unnecessary words; shorter prompts cost less
  • Aggressive prompt caching — architect retrieval and system prompts for maximum cache hit rate
  • Response budgeting — set max_tokens to a tight envelope; long responses cost more
  • Batch processing — for offline workloads, Anthropic's batch API offers discounted pricing
  • Speculative decoding — for high-throughput workloads where supported

Most clients see 15–30% cost reduction in the first quarter of optimisation work.

Claude API Pricing Explained — FAQ

Talk to a Claude architect

48-hour response from a senior architect. The Readiness Assessment scopes the work and proposes named engineers.

Request Readiness Assessment