Home/Blog/AI Compute Cost Engineering
AI Infrastructure

AI Compute Cost Engineering: Tokens, GPUs, Caches, And Queue Backpressure

An AI request can succeed technically and still be a product failure because it consumed too many tokens, waited behind an unbounded queue, occupied expensive GPU memory, or triggered a chain of calls whose value never justified their cost.

Every request needs an economic contract

OpenAI's cost guidance groups the obvious levers: make fewer requests, use fewer tokens, and select smaller models when they preserve quality. Its latency guidance adds an important warning: output generation is often the slowest part, and an LLM should not be the default for work a deterministic method can perform. These are architecture decisions, not prompt-writing tricks.

A request budget should state the maximum model calls, input tokens, output tokens, tool calls, wall-clock time, queue time, retries, and money that one task may consume. Without that contract, an agent can be “working correctly” while recursively spending through the product margin.

AI request cost pipeline
AdmissionTenant quota, task value, priority, deadline, and budget decide whether work enters.
ContextRetrieval, history, tools, images, and prompt structure determine input work.
RoutingChoose model, provider, region, quality tier, and synchronous or batch path.
InferencePrefill, cached prefix, decode, batching, GPU memory, and utilization consume capacity.
QueueBacklog, priority, timeout, cancellation, and backpressure control waiting work.
OutcomeQuality, latency, business value, retries, and user action determine whether cost paid off.

Tokens are measurable work units, not the whole bill

Track uncached input, cached input, output, reasoning, tool calls, retrieval, storage, network, and retries separately. A single total-token number hides whether the system is sending bloated context, generating verbose responses, or repeatedly paying for the same prefix. OpenAI prompt caching exposes cached-token usage and recommends putting stable content first so exact shared prefixes can be reused.

For self-hosted inference, prefix caching serves the same economic purpose. vLLM documents reusing KV-cache blocks when requests share a prefix, avoiding redundant prompt computation. But cache hit rate is a workload property. A cache that consumes GPU memory without serving repeated prefixes can increase cost instead of reducing it.

Work budgetModel calls, input/output tokens, tools, retrievals, retries, and maximum chain depth.
Time budgetQueue deadline, first-token target, total timeout, and cancellation propagation.
Capacity budgetGPU memory, concurrency slots, batch delay, cache allocation, and tenant share.
Value budgetCost ceiling, quality floor, task priority, user tier, and expected business outcome.

GPU utilization is a scheduling problem

A GPU can be expensive while idle and slow while overloaded. Dynamic batching combines compatible inference requests to improve throughput, but it trades a little waiting time for better utilization. NVIDIA Triton exposes batch size, maximum queue delay, queue size, priorities, and timeouts because there is no single correct setting. Interactive chat and overnight enrichment should not compete under the same latency policy.

Autoscaling helps only when the scaling signal represents the real bottleneck and new capacity arrives before the backlog becomes obsolete. Kubernetes HPA can scale from custom and external metrics, but it operates as a control loop with delays and stabilization. Queue depth, oldest-message age, active sequences, cache pressure, and GPU saturation are often more meaningful than CPU alone.

Backpressure protects both reliability and margin

Queues make bursts survivable, but they also hide overload until waiting work becomes impossible to drain. Amazon's Builders' Library describes preventing insurmountable backlogs, prioritizing workloads, and shedding load. For AI systems, stale work is especially wasteful: a delayed suggestion, duplicate agent run, or evaluation for an outdated model version may consume full inference cost and deliver no value.

Bound queues by count, age, and economic value. Reject early when the task cannot meet its deadline. Cancel downstream calls when the user leaves or the parent task expires. Separate interactive, background, evaluation, and low-priority queues. Backpressure is not a failure of the product; silent, unlimited spending is.

Cost pressureSignal to observeEngineering controlFailure if ignored
Oversized contextInput tokens, retrieval hit quality, cached-token ratio.Filter context, summarize, version prompts, stabilize shared prefixes.Repeated prefill cost and longer latency.
Verbose generationOutput tokens, time to completion, abandoned responses.Output limits, structured responses, concise prompts, smaller model.Decode cost dominates while users wait or leave.
Too many model callsCalls per task, chain depth, retries, tool-loop count.Budgeted orchestration, combine or parallelize steps, deterministic alternatives.Agent loops multiply cost and failure probability.
Low GPU utilizationActive sequences, batch size, GPU memory and compute utilization.Dynamic batching, routing, right-sized models, scheduled batch workloads.Paying for idle capacity.
Queue overloadDepth, oldest age, deadline misses, cancellation rate.Bounded queues, priority, admission control, load shedding, backpressure.Stale work consumes capacity and backlog never drains.
Uncontrolled tenant spendCost per tenant, task, feature, outcome, and time window.Quotas, per-task budgets, model tiers, alerts, and graceful degradation.One workflow destroys margin for everyone.

What I would build

I would build an inference cost control plane that assigns every task a budget before execution. It would route simple work to deterministic code or smaller models, preserve cache-friendly prefixes, choose interactive versus batch capacity, and stop chains when marginal value falls below remaining cost.

The dashboard would connect money to behavior: cost per successful outcome, input versus output versus cached tokens, GPU utilization, batch fill, queue age, deadline misses, cancellations, retries, and cost avoided by routing, caching, or rejection. The best savings metric is not “spent less”; it is “removed waste without reducing useful outcomes.”

The design principle

Capacity is not infinite merely because an API accepts another request. Cost engineering makes scarcity explicit at every layer. Budget work before it begins, reuse computation when it is genuinely reusable, batch where latency allows, and apply backpressure before queues turn expensive work into worthless work.