Home/Blog/Cloud Cost Observability
FinOps

Cloud Cost Observability For AI Workloads

A provider invoice can tell you what the organization spent. It usually cannot tell you which tenant, feature, agent, evaluation, scheduled job, or useful outcome caused the spend. AI cost observability builds that missing chain.

AI spend crosses too many meters for invoice-only FinOps

FinOps Foundation describes FinOps for AI as a response to complex cost models, fast development cycles, unpredictable spend, and the need to align consumption with business value. A single AI feature may create model tokens, embedding jobs, vector database queries, object storage, GPU capacity, logs, egress, tool calls, and scheduled evaluations across several providers.

Cloud tags are necessary, but they stop at the resource boundary. Shared inference endpoints and vector clusters serve many tenants and features. The application must add the dimensions cloud billing cannot see: request, task, agent, prompt version, feature, tenant, environment, owner, and outcome.

AI cost allocation dashboard
Business viewCost per useful outcomeSuccessful resolution, accepted suggestion, completed workflow, or revenue event.
Product viewCost per feature and tenantWhich product surfaces and customer tiers consume shared AI capacity.
Workflow viewCost per task and agentModel calls, tools, retries, scheduled jobs, and chain depth by operation.
Model viewInference and embedding mixModels, providers, cached input, output, batch, evaluations, and quality.
Platform viewShared infrastructureGPU pools, vector databases, storage, observability, queues, and egress.
Governance viewOwnership and allocationTeam, project, cost center, environment, budget, anomalies, and forecast.

Build a cost lineage, not another billing dashboard

The FOCUS specification provides a common language for cost and usage data. AWS cost allocation tags and Azure cost allocation rules add business context and distribute shared costs. OpenAI's Usage API can group costs by project and line item. OpenTelemetry's generative AI semantic conventions standardize telemetry for model operations. These are useful pieces, but the system still needs a join key that connects provider usage back to the application trace and business outcome.

Carry correlation identifiers from user request through agent task, model call, embedding job, vector query, storage operation, and downstream tool. Then join usage records and shared-resource allocation to those identifiers. The result is a cost trace that explains both direct charges and allocated platform overhead.

OwnershipTeam, cost center, product, environment, owner, and budget policy.
WorkloadFeature, tenant, task, agent, model, prompt version, and scheduled job.
ResourceTokens, GPU, embeddings, vector DB, storage, egress, logs, and tools.
OutcomeQuality, latency, completion, acceptance, revenue, failure, and abandonment.

Allocate shared AI infrastructure with usage drivers

Equal allocation is simple and usually wrong. Shared GPU pools may be allocated by GPU-seconds or active sequence time. Vector databases may use stored vectors, query units, and data transfer. Observability may use spans or bytes ingested. Scheduled evaluation platforms may use model calls and dataset size. The allocation driver should resemble the behavior that creates cost and remain understandable to the team being charged.

Keep allocation confidence visible. Directly metered model calls are high confidence. A shared cluster divided by estimated usage is lower confidence. Unallocated cost should be a tracked metric, not silently spread across teams.

Cost surfaceDirect usage signalUseful allocation dimensionOptimization question
Model inferenceInput, cached, output, reasoning tokens, calls, model, tier.Tenant, feature, task, agent, prompt version, outcome.Which model and prompt produce value at acceptable quality?
EmbeddingsInput tokens, documents, refresh frequency, batch jobs.Corpus, product, tenant, pipeline, data owner.Are unchanged or unused documents being re-embedded?
Vector databaseStored vectors, query units, replicas, index builds, transfer.Corpus, tenant, feature, environment.Which indexes and replicas produce useful retrieval?
GPU and servingGPU-seconds, memory, active sequences, batch fill, idle time.Model, workload class, tenant share, endpoint.Is shared capacity allocated and utilized fairly?
Storage and egressBytes stored, retained, read, written, and transferred.Dataset, artifact, tenant, region, retention policy.Which data is duplicated, stale, or crossing regions unnecessarily?
Scheduled jobs and evalsRuns, model calls, dataset rows, duration, failures, retries.Model version, experiment, owner, release decision.Which recurring jobs no longer influence a decision?

What I would build

I would build a cost telemetry pipeline that enriches every AI operation with application dimensions before exporting traces and usage. Provider cost exports, FOCUS-normalized billing data, model usage APIs, Kubernetes metrics, vector database usage, and storage reports would join into a cost ledger.

The dashboard would let a product owner move from monthly spend to feature, tenant, workflow, model call, and outcome. It would highlight unallocated spend, low-confidence allocations, cost anomalies, idle shared capacity, stale scheduled jobs, and features whose cost grows faster than their useful outcomes.

The design principle

Cost observability is complete only when the person who can change the system can see the cost they influence and the value it creates. Allocate cloud resources to workloads, workloads to product behavior, and product behavior to outcomes. Anything left as “shared AI cost” is an engineering question still unanswered.