We reviewed the total cost of ownership for forty-seven production LLM deployments we've either built or advised on since mid-2023. The honest conclusion: about a third never pay back, a third break even only after aggressive optimization, and the remaining third deliver the kind of returns the industry pitch-decks suggest everyone is getting. The difference is not model choice. It is almost entirely in the design of the use case.
Why we wrote this
Enterprise buyers are making multi-year commitments to LLM infrastructure based on cost assumptions that rarely survive contact with production traffic. Vendor pricing examples assume small inputs, short outputs, and one-shot inference — none of which describe what actual enterprise workloads look like. Meanwhile, the cost of "doing it right" — prompt engineering, evaluation harnesses, guardrails, monitoring, fallback paths — often exceeds the raw inference cost by a factor of three.
What follows is not a theoretical framework. It is the pattern we see, repeatedly, across our engagements. If your organization is in the planning stage of an LLM deployment, the cost ranges below will be more accurate than whatever your vendor's pricing calculator is showing you.
The data
Forty-seven deployments, observed over 8-36 months each, spanning financial services, healthcare, retail, manufacturing, government, and telecom. Use cases included: document intelligence, text-to-SQL, customer service augmentation, code generation, content generation, internal knowledge retrieval, and a handful of agent-style workflows.
For each, we measured three things: the all-in unit economics (inference + infrastructure + operations), the measurable business outcome (revenue lift, cost reduction, or efficiency gain), and the ratio between them. We then grouped them into three cohorts based on 12-month payback performance.
Where the real cost lives
Raw inference — the number you see on vendor pricing pages — averaged about 14% of total cost across our sample. The other 86% is distributed across categories that don't show up in initial business cases:
- Evaluation and quality infrastructure (21% of total cost). The harnesses, test sets, and continuous evaluation pipelines needed to detect model drift and regressions. Invariably underestimated in initial proposals.
- Guardrails and safety filtering (16%). PII detection, prompt injection defense, output moderation. For regulated industries, this alone can match the base inference cost.
- Data integration and retrieval (14%). RAG infrastructure, vector database costs, document chunking and indexing pipelines. Scales linearly with corpus size in ways that surprise finance teams.
- Human review of outputs (13%). For any use case where decisions have consequences, you are going to need humans in the loop for some percentage of outputs. This is often the single largest line item at year two.
- Application engineering (12%). The actual software that wraps the model into a usable product. LLMs are not applications; they are components.
- Monitoring and observability (10%). Logging, tracing, cost attribution, usage analytics. Without this you cannot optimize, and without optimization costs compound.
The 14% that remains for inference itself is further split roughly 60/40 between the main inference calls and the secondary calls needed for guardrail checks, classification, and fallback handling.
In the initial business case we reviewed for one retail client, the 12-month LLM inference cost was projected at $340k. The actual 12-month all-in cost came in at $2.1M — a 6.2x multiple. The inference number had been exactly right. Everything else had been invisible.
What separates the winners
When we cluster our sample into the three cohorts — pays back quickly, eventually, never — a handful of characteristics predict which cohort a deployment ends up in. These are the patterns we now look for when scoping engagements.
1. The use case has a clear outcome metric that moves quickly
Winning deployments are tied to metrics like closed deals, resolved tickets, approved loans, or dispatched trucks — things measurable weekly or monthly. Deployments tied to fuzzy goals ("improve employee productivity," "enhance decision-making") consistently struggle to justify themselves, because no one can agree on what "improved" means.
2. The LLM is replacing a specific, quantifiable manual step
If there is a person today spending two hours a day doing a task that the model can do in thirty seconds, the math works almost regardless of model choice. If the LLM is adding a capability the organization didn't previously have, the math has to work against a soft counterfactual, and the ROI case tends to unravel under scrutiny.
3. The expected output has a narrow shape
Structured outputs — classification labels, extracted fields, specific JSON schemas — are much cheaper to operate than open-ended text generation. They are cheaper to evaluate, cheaper to moderate, and easier to handle when they go wrong. The most reliable winners in our sample use LLMs for bounded tasks, not for open-ended ones.
4. The system has a functional fallback when the model is wrong
Every deployment will produce bad outputs sometimes. Winners design for this from day one: confidence thresholds that route uncertain cases to humans, cached deterministic answers for common queries, model fallbacks when the primary is unavailable. Losers treat each of these as a feature to add "after we prove the ROI." They rarely get the chance.
Three use cases that consistently pay back
Across our sample, three categories of use case deliver strong payback with notable consistency:
- Document-heavy back-office operations. Invoice processing, claims triage, trade-finance documentation, KYC onboarding. Replaces measurable human-hours with bounded tasks that the model handles well. Median payback: 8-11 months.
- Internal knowledge retrieval in complex domains. Legal research, clinical reference, technical support, tax. Narrow corpora, measurable deflection from human-handled queries. Median payback: 10-14 months.
- Structured extraction from unstructured text. Pulling specific fields from contracts, regulatory filings, medical records, or customer communications into structured systems. Bounded, verifiable, replaces manual data entry. Median payback: 7-10 months.
Three use cases that rarely pay back
Conversely, three categories where we have yet to see a deployment deliver positive ROI in twelve months:
- Generic internal chatbots. "Ask anything about our company" style assistants. Usage drops within weeks once the novelty wears off. Hard to attribute any business outcome.
- Open-ended content generation for marketing. The outputs need so much human revision that the cost of the revision exceeds the cost of writing from scratch, plus the content tends to be diluted and off-brand.
- Autonomous agents handling novel workflows. Still experimental. In our sample, every "agent" deployment ended up either scoped down to a narrow bounded task (which worked) or abandoned.
How to model this for your organization
If you are building a business case for an LLM deployment, our recommendation is to build your cost model from the top down, not from vendor pricing up. Estimate the total cost at roughly 6-8x the raw inference cost for year one, declining to perhaps 4-5x by year three as infrastructure matures. Then test the business case at that higher cost level before you commit. If the case falls apart at realistic cost, it was never going to work anyway.
On the benefit side, require the use case to name a specific human activity being replaced or augmented, measurable in hours-per-week or conversions-per-month. Vague goals produce vague outcomes, which produce unmeasurable ROI, which produces projects that drift into the "never pay back" bucket.
None of this is an argument against deploying LLMs. Thirty-five percent of the deployments we studied paid back within a year — some spectacularly. What our sample shows is that the winners were doing something specific, not something general. Find the specific thing in your organization. Start there.