Why AI Costs Explode in Production
This is a blog series around Maester, an AI APIs toolkit project. Check the previous post in the series: Why Debugging AI Systems Is Harder
When people talk about AI cost optimization, the conversation often starts in the wrong place.
The typical advice sounds like this:
- Use smaller models
- Shorten prompts
- Cache responses
- Batch requests
- Route to cheaper providers
All of these help. But none of them actually guarantee that your system won’t overspend.
In production systems, the real problem is simpler:
A single request can unexpectedly cost far more than you intended.
This happens more often than teams expect.
Before building the request budget feature in Maester, I stepped back and looked at how cost optimization is actually done in modern AI systems.
The Layers of AI Cost Optimization
Most AI systems reduce cost using some combination of five approaches.
1. Request Shaping
The simplest lever is reducing the work done per request.
Examples:
- Trimming prompt size
- Reducing context window
- Lowering max_tokens
- Making fewer model calls
This is usually the first place teams optimize. But it relies on developer discipline rather than system enforcement.
2. Model Selection
Another common technique is using smaller models whenever possible.
For example:
gpt-4.1
↓
gpt-4.1-mini
↓
gpt-4o-mini
This is why many AI gateways now support model routing or fallback models. The challenge is deciding when it is safe to downgrade.
3. Prompt and Response Caching
Many prompts repeat across requests.
Systems reduce cost by caching:
- Prompt prefixes
- Retrieval results
- Model responses
Caching can significantly reduce token usage in workloads with repeated prompts.
4. Async and Batch Workloads
Not every AI request must run synchronously. For non-interactive workloads like:
- Summarization
- Document processing
- Evaluation jobs
systems can batch requests or process them asynchronously. This reduces per-request cost and improves throughput.
5. Gateway Cost Controls
This is where things get interesting. Modern AI gateways like LiteLLM, Helicone, and OpenRouter provide:
- Model routing
- Provider failover
- Request tracking
- Cost analytics
These tools improve visibility. But visibility alone doesn’t prevent runaway requests. The missing control: Request Budgets
Imagine this situation.
Your product uses a powerful model:
gpt-4.1
A developer accidentally sends a request with:
max_tokens = 4000
Or a prompt expands because retrieval returned too much context.
Suddenly the request costs 10× more than expected. The system still works. But your cost curve quietly explodes. So we applied the very straightforward implementation: enforce a cost budget before the request is sent.
How Request Budgets Work
The Budget Guard
Instead of sending every request directly to the model gateway, Maester inserts a pre-flight budget guard.
The flow becomes:
API request
↓
Budget estimate
↓
Policy decision
↓
Model gateway
↓
Cost meter
The budget guard estimates the worst-case cost of a request before execution.
Then it decides whether to:
- Allow the request
- Downgrade to a cheaper model
- Reject the request
The goal was not to build a billing system. The goal was to introduce the first enforceable cost-control primitive into the reliability stack.
The implementation is intentionally split into a few small components.
The Budget Layer
We added a dedicated packages/budgets package instead of putting cost-control logic directly inside the route or the model gateway.
The structure looks like this:
packages/budgets/
models.py
service.py
ledger.py
errors.py
utils.py
This split matters because request budgets are not just a single function.
They involve:
- Policy definition
- Estimation
- Decision logic
- Cost recording
- Reusable helpers
Keeping those concerns separate makes the feature easier to extend later.
Core Budget Models
The budget package starts with three core models.
BudgetPolicy
Defines the guardrails for a request. Typical fields include:
max_cost_usd
max_input_tokens
max_output_tokens
max_total_tokens
fallback_model_if_over_budget
This is the declarative part of the system. It describes what is allowed before a request is executed.
BudgetEstimate
Represents the estimated shape of a request before dispatch. It includes:
- estimated input tokens
- estimated output tokens
- estimated total tokens
- estimated input cost
- estimated output cost
- estimated total cost
This is the object the policy engine evaluates.
BudgetDecision
Represents the result of the policy check. It answers:
Is the request allowed?
Should it fall back to a cheaper model?
Should it be blocked?
What model will actually be used?
This separation is important. We did not want estimation logic and policy decisions mixed together in the route handler.
The Service Layer
The real logic lives in:
packages/budgets/service.py
which contains the RequestBudgetGuard. This service is responsible for three things:
- Estimating the request before execution
- Deciding whether it should proceed
- Recording actual post-request cost in the ledger
Conceptually, the flow looks like this:
incoming API request
↓
RequestBudgetGuard.estimate_request()
↓
RequestBudgetGuard.check_request()
↓
allow / fallback / reject
↓
model gateway execution
↓
RequestBudgetGuard.record_actual_cost()
This is the heart of the feature.
Estimating Request Cost
To estimate the worst-case cost, we calculate:
estimated_input_tokens
estimated_output_tokens
estimated_total_cost
The output tokens are estimated using:
max_tokens
Because that represents the maximum possible response size. Input tokens are approximated from prompt length. The estimate doesn’t need to be perfect. It only needs to be good enough to enforce policy.
Budget Policy Example
A request policy might look like this:
max_cost_usd = 0.001
fallback_model = gpt-4.1-mini
If the request would exceed the budget: the system first tries the fallback model. If it still exceeds the budget, the request is rejected. This simple rule prevents accidental cost explosions.
Three Scenarios in Practice
The Maester repository includes curl scenarios demonstrating the behavior.
Scenario 1 — Request Allowed
curl http://localhost:8000/v1/reliable_completion \
-H "Content-Type: application/json" \
-d '{
"prompt": "Explain vector databases.",
"model": "gpt-4.1-mini",
"max_tokens": 200,
"max_cost_usd": "0.01"
}'
The request fits within the budget and executes normally.
Scenario 2 — Automatic Downgrade
curl http://localhost:8000/v1/reliable_completion \
-H "Content-Type: application/json" \
-d '{
"prompt": "Explain transformers architecture.",
"model": "gpt-4.1",
"max_tokens": 500,
"max_cost_usd": "0.001",
"allow_fallback_to_cheaper_model": true
}'
The original request would exceed the budget. The gateway automatically downgrades the request to a cheaper model.
Scenario 3 — Request Blocked
curl http://localhost:8000/v1/reliable_completion \
-H "Content-Type: application/json" \
-d '{
"prompt": "Write a long explanation of reinforcement learning.",
"model": "gpt-4.1",
"max_tokens": 1000,
"max_cost_usd": "0.00001"
}'
Even the fallback model exceeds the budget. The request is rejected before any model call is made.
Why This Matters
Most AI cost discussions focus on optimization. But optimization comes after the system is already running. Request budgets do something different. They provide a hard safety boundary.
Instead of asking:
How do we reduce cost?
the system first guarantees: this request cannot exceed this cost. That changes the economics of operating AI systems.
Where This Fits in the Reliability Stack?
In Maester, request budgets are one component of a broader reliability architecture:
- Prompt registry
- Request replay
- Evaluation pipelines
- Model gateway
- Request budgets
Together, these pieces help teams run AI systems in production without flying blind.
Suggested Citation:
Lei Ye. Why AI Costs Explode in Production. 2026. https://lei-ye.dev/blog/ai-cost-optimization