Why AI Costs Explode in Production

This is a blog series around Maester, an AI APIs toolkit project. Check the previous post in the series: Why Debugging AI Systems Is Harder


When people talk about AI cost optimization, the conversation often starts in the wrong place.

The typical advice sounds like this:

  • Use smaller models
  • Shorten prompts
  • Cache responses
  • Batch requests
  • Route to cheaper providers

All of these help. But none of them actually guarantee that your system won’t overspend.

In production systems, the real problem is simpler:

A single request can unexpectedly cost far more than you intended.

This happens more often than teams expect.

Before building the request budget feature in Maester, I stepped back and looked at how cost optimization is actually done in modern AI systems.

The Layers of AI Cost Optimization

Most AI systems reduce cost using some combination of five approaches.

1. Request Shaping

The simplest lever is reducing the work done per request.

Examples:

  • Trimming prompt size
  • Reducing context window
  • Lowering max_tokens
  • Making fewer model calls

This is usually the first place teams optimize. But it relies on developer discipline rather than system enforcement.

2. Model Selection

Another common technique is using smaller models whenever possible.

For example:

gpt-4.1

gpt-4.1-mini

gpt-4o-mini

This is why many AI gateways now support model routing or fallback models. The challenge is deciding when it is safe to downgrade.

3. Prompt and Response Caching

Many prompts repeat across requests.

Systems reduce cost by caching:

  • Prompt prefixes
  • Retrieval results
  • Model responses

Caching can significantly reduce token usage in workloads with repeated prompts.

4. Async and Batch Workloads

Not every AI request must run synchronously. For non-interactive workloads like:

  • Summarization
  • Document processing
  • Evaluation jobs

systems can batch requests or process them asynchronously. This reduces per-request cost and improves throughput.

5. Gateway Cost Controls

This is where things get interesting. Modern AI gateways like LiteLLM, Helicone, and OpenRouter provide:

  • Model routing
  • Provider failover
  • Request tracking
  • Cost analytics

These tools improve visibility. But visibility alone doesn’t prevent runaway requests. The missing control: Request Budgets

Imagine this situation.

Your product uses a powerful model:

gpt-4.1

A developer accidentally sends a request with:

max_tokens = 4000

Or a prompt expands because retrieval returned too much context.

Suddenly the request costs 10× more than expected. The system still works. But your cost curve quietly explodes. So we applied the very straightforward implementation: enforce a cost budget before the request is sent.

How Request Budgets Work

The Budget Guard

Instead of sending every request directly to the model gateway, Maester inserts a pre-flight budget guard.

The flow becomes:

API request

Budget estimate

Policy decision

Model gateway

Cost meter

The budget guard estimates the worst-case cost of a request before execution.

Then it decides whether to:

  1. Allow the request
  2. Downgrade to a cheaper model
  3. Reject the request

The goal was not to build a billing system. The goal was to introduce the first enforceable cost-control primitive into the reliability stack.

The implementation is intentionally split into a few small components.

The Budget Layer

We added a dedicated packages/budgets package instead of putting cost-control logic directly inside the route or the model gateway.

The structure looks like this:

packages/budgets/
  models.py
  service.py
  ledger.py
  errors.py
  utils.py

This split matters because request budgets are not just a single function.

They involve:

  • Policy definition
  • Estimation
  • Decision logic
  • Cost recording
  • Reusable helpers

Keeping those concerns separate makes the feature easier to extend later.

Core Budget Models

The budget package starts with three core models.

BudgetPolicy

Defines the guardrails for a request. Typical fields include:

max_cost_usd
max_input_tokens
max_output_tokens
max_total_tokens
fallback_model_if_over_budget

This is the declarative part of the system. It describes what is allowed before a request is executed.

BudgetEstimate

Represents the estimated shape of a request before dispatch. It includes:

  • estimated input tokens
  • estimated output tokens
  • estimated total tokens
  • estimated input cost
  • estimated output cost
  • estimated total cost

This is the object the policy engine evaluates.

BudgetDecision

Represents the result of the policy check. It answers:

Is the request allowed?
Should it fall back to a cheaper model?
Should it be blocked?
What model will actually be used?

This separation is important. We did not want estimation logic and policy decisions mixed together in the route handler.

The Service Layer

The real logic lives in:

packages/budgets/service.py

which contains the RequestBudgetGuard. This service is responsible for three things:

  1. Estimating the request before execution
  2. Deciding whether it should proceed
  3. Recording actual post-request cost in the ledger

Conceptually, the flow looks like this:

incoming API request

RequestBudgetGuard.estimate_request()

RequestBudgetGuard.check_request()

allow / fallback / reject

model gateway execution

RequestBudgetGuard.record_actual_cost()

This is the heart of the feature.

Estimating Request Cost

To estimate the worst-case cost, we calculate:

estimated_input_tokens
estimated_output_tokens
estimated_total_cost

The output tokens are estimated using:

max_tokens

Because that represents the maximum possible response size. Input tokens are approximated from prompt length. The estimate doesn’t need to be perfect. It only needs to be good enough to enforce policy.

Budget Policy Example

A request policy might look like this:

max_cost_usd = 0.001
fallback_model = gpt-4.1-mini

If the request would exceed the budget: the system first tries the fallback model. If it still exceeds the budget, the request is rejected. This simple rule prevents accidental cost explosions.

Three Scenarios in Practice

The Maester repository includes curl scenarios demonstrating the behavior.

Scenario 1 — Request Allowed

curl http://localhost:8000/v1/reliable_completion \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Explain vector databases.",
    "model": "gpt-4.1-mini",
    "max_tokens": 200,
    "max_cost_usd": "0.01"
  }'

The request fits within the budget and executes normally.

Scenario 2 — Automatic Downgrade

curl http://localhost:8000/v1/reliable_completion \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Explain transformers architecture.",
    "model": "gpt-4.1",
    "max_tokens": 500,
    "max_cost_usd": "0.001",
    "allow_fallback_to_cheaper_model": true
  }'

The original request would exceed the budget. The gateway automatically downgrades the request to a cheaper model.

Scenario 3 — Request Blocked

curl http://localhost:8000/v1/reliable_completion \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Write a long explanation of reinforcement learning.",
    "model": "gpt-4.1",
    "max_tokens": 1000,
    "max_cost_usd": "0.00001"
  }'

Even the fallback model exceeds the budget. The request is rejected before any model call is made.

Why This Matters

Most AI cost discussions focus on optimization. But optimization comes after the system is already running. Request budgets do something different. They provide a hard safety boundary.

Instead of asking:

How do we reduce cost?

the system first guarantees: this request cannot exceed this cost. That changes the economics of operating AI systems.

Where This Fits in the Reliability Stack?

In Maester, request budgets are one component of a broader reliability architecture:

  • Prompt registry
  • Request replay
  • Evaluation pipelines
  • Model gateway
  • Request budgets

Together, these pieces help teams run AI systems in production without flying blind.

Suggested Citation:


Lei Ye. Why AI Costs Explode in Production. 2026. https://lei-ye.dev/blog/ai-cost-optimization


#AI Infrastructure#SaaS Architecture#Production ML#System Design