Mastering Amazon Bedrock cost optimization

Bedrock pricing looks simple on the marketing page. The bill almost never is. Here is the playbook we use to keep it sane.

By ForthClover EngineeringJanuary 20269 min readCost Optimization

Bedrock's pricing page looks straightforward. A per-thousand-token rate for input, a per-thousand-token rate for output, a small line item for embeddings. You do the multiplication, you slot a number into the budget, you ship. Then the bill arrives at the end of the month and it is two or three times what the spreadsheet said.

We have run more than a few of these conversations with clients. The pricing card is honest; it just does not capture how the bill actually accumulates in production. This post is the playbook we use to keep Bedrock spend predictable, written from the perspective of teams that already have something in production rather than teams pricing out a slide.

Where the bill actually comes from

Almost every Bedrock cost surprise we've diagnosed traces to one of four causes:

  • Output tokens nobody budgeted for. Output is usually three to five times more expensive per token than input, and a chatty system prompt or a model in a runaway state can generate thousands of output tokens per call. Most cost models we've seen estimate output as 1× input. It is not.
  • Retries that look free. A 5xx on a long generation still costs you the partial output that was streamed before the failure, and the retry pays for the whole thing again. With aggressive retry policies it is easy to pay 2–3× for a single user-visible response.
  • Embedding work that scales with traffic, not corpus. Teams budget the one-time cost of embedding their corpus and forget that every user query also needs to be embedded. On a high-traffic RAG system, query embeddings can quietly become the largest line on the bill.
  • Provisioned throughput bought for the peak. Provisioned throughput is billed by the hour whether you use it or not. A team sizes it for Black Friday, leaves it on for the rest of the year, and the bill is dominated by idle capacity.

Right-size the model per request, not per app

The single highest-leverage decision is which model handles which request. There is no good reason to send a routine classification call to Claude Sonnet when a Haiku-class model gets the right answer for a fraction of the price. The mistake we see most often is teams picking one strong model for everything and then trying to optimize tokens around it.

A useful pattern: build a small router on the call path that inspects the request shape (length, task type, or a cheap classifier) and dispatches to the smallest model that can handle that class of work. Reserve the expensive models for the calls that actually need them — long-context reasoning, complex tool-use plans, agent steps where you cannot afford a wrong turn. Everything else goes to a cheaper tier.

The shape of the savings depends on the workload, but on the mixed agent + RAG systems we've tuned, routing has typically taken 40–60% off the model bill without measurably moving quality scores on the eval set.

The cost / quality / latency table we ask clients to fill in

Before we recommend a routing strategy, we get the team to commit numbers to a table that looks like this:

Model tierUse caseEval scorep95 latencyCost / 1k req
Small (e.g. Haiku-class)Classification, extraction, short summariesTo measureTo measureTo measure
Mid (e.g. Sonnet-class)RAG answer composition, structured generationTo measureTo measureTo measure
Large (e.g. Opus-class)Agent planning, long-context reasoningTo measureTo measureTo measure

The point is not the table itself; it is forcing the team to measure each tier on their own evaluation set rather than reasoning from the marketing page. Half the time the team discovers the smaller model is good enough for a much larger slice of traffic than they assumed.

Cache the calls you keep making over and over

A surprising amount of production AI traffic is repetitive. Identical prompts. Near-identical prompts that differ only in formatting. The same RAG context retrieved twice within a minute. Caching at two levels usually cuts another 20–40% off the bill on top of routing.

  • Prompt-level caching (now natively supported on Bedrock for some models) lets you reuse the long system prompt and shared context tokens across calls without re-billing for them. For agents and RAG this is significant because the system prompt + retrieval context often dominate the input tokens.
  • Response-level caching for read-heavy workloads where the same question is asked over and over. We usually key off a hash of (system prompt + user message + relevant retrieval IDs) and store responses in a small Redis or DynamoDB table with a short TTL. For knowledge-base queries the hit rate is often 30–50%.

The standard caveat applies: cache invalidation. If your underlying knowledge base changes daily and you cache for longer than that, you will serve stale answers and lose user trust. Pick a TTL that matches your data freshness, not your cost target.

Batch what you can, stream what you must

Bedrock's batch inference endpoint is dramatically cheaper than on-demand for workloads that can tolerate latency — about 50% off list price at the time of writing. If a class of work does not need to be real-time (overnight enrichment, batch classification of incoming records, periodic re-summarization of the corpus), it should be on the batch endpoint.

For real-time work, streaming responses do not change the per-token cost, but they let you start cancelling early when the model is heading the wrong direction. Combined with a repetition detector or a guardrail that watches for known failure patterns, this is one of the few ways to actually pay less for a single call rather than restructuring traffic.

Provisioned throughput is for floors, not peaks

Provisioned throughput is genuinely cheaper per token when you can keep it busy. The mistake is sizing it for the peak. The math only works if the provisioned units are saturated for most of the day. The pattern we use:

  1. Measure your traffic floor — the level of usage you sustain for at least 16 hours a day.
  2. Provision throughput at that floor. Everything above it spills onto on-demand.
  3. Re-measure quarterly. Floors move as products grow; provisioned commitments need to move with them or you are paying for capacity you no longer need.

Provisioning at the peak almost always costs more than just paying on-demand for the peak.

Make the bill observable from day one

The cheapest cost optimization is the one you make before the bill arrives. Tagging every Bedrock call with the application, the user, and the use case at the time of the call lets you attribute spend to specific features the moment something spikes. Without this, every monthly review starts with two hours of forensic SQL trying to reverse-engineer what cost what.

We use a thin wrapper around the Bedrock SDK that:

  • Logs every call (model, input tokens, output tokens, latency, tags) to CloudWatch or a structured-events table.
  • Exposes a per-feature daily-cost dashboard so the engineering team can see drift before finance does.
  • Triggers an alert when a single feature's daily spend doubles versus the trailing 7-day average. This catches runaway agents and bad deploys within hours instead of weeks.

The wrapper is fifty lines of code. It pays for itself the first time it catches a leaking feature.

What we do not bother optimizing

Three things that are technically optimizations but rarely worth the engineering time on a serious production workload:

  • Aggressive prompt compression. Squeezing 15% out of a system prompt rarely moves the needle compared to routing or caching, and it makes the prompt harder to iterate on. Reserve compression for the cases where the prompt is the dominant cost.
  • Switching providers to chase a 5% rate cut. The migration cost is almost always larger than the saving, and the operational risk of a new vendor in the critical path is real.
  • Hand-tuning temperature for cost. It does not affect cost at all. We bring this up because we have seen three different teams convince themselves it does.

The order we work in

When we're brought in to fix a Bedrock bill, the order of operations is almost always the same:

Playbook

The order we work in

Steps 1–3 land 50–70% of the savings on most production Bedrock workloads we've tuned.

1Add observabilityTag every call byfeature, user, use case2Right-size modelsRoute to the smallestmodel per request3CachePrompt-level first,response-level second4BatchMove latency-tolerantwork to batch inference5Re-evaluateDecide if more workis worth the spendHigh-leverage (do these first)Diminishing returns past step 3
  1. Add observability (so you can see what is costing what).
  2. Right-size models per request (the biggest single win).
  3. Cache (prompt-level first, response-level second).
  4. Move latency-tolerant work to batch and re-evaluate provisioned throughput.
  5. Re-measure. Decide whether further optimization is worth it.

On most production workloads we've worked on, the first three steps land somewhere in the 50–70% bill-reduction range, and steps four and five matter less and less. The temptation is always to start with the clever optimization. Start with the measurement.