Text-to-SQL in the Wild: 200 Production Deployments

Text-to-SQL — the ability to ask a database a question in plain language and get a correct answer — is the AI demo that converts executives faster than any other. It is also the one most likely to create a follow-up conversation six months later, in which the same executives quietly ask what happened to the deployment. We've reviewed two hundred such deployments over the last two years. Here is what separates the ones that are still in production from the ones that are not.

Why this is hard

On the surface, text-to-SQL looks like a well-defined translation task. A user asks "what was our revenue in Q3 by region?" The system emits SELECT region, SUM(revenue) FROM sales WHERE quarter='Q3' GROUP BY region, runs it, returns a table. Demo-able in twenty minutes.

The reason it's hard in production is that nothing about that example holds up. Real enterprise databases are not well-modeled; they are eighteen years of accumulated pragmatism with archaeology of naming conventions, table merges that were never cleaned up, and business rules encoded in comments nobody reads. Real users do not ask "what was our revenue in Q3" — they ask "how did we do last quarter," which requires the system to know which fiscal calendar the organization uses, what "we" means in the user's role context, and what counts as "doing well."

Production text-to-SQL has to succeed across all of this, reliably, while also not producing an answer so wrong that it misleads a decision-maker. In our dataset, the systems that are still live share a specific set of design choices. The ones that aren't, usually skipped the same ones.

The cohorts

We grouped our 200-deployment sample into three cohorts based on their status 12 months after go-live:

28%

Still in active production

22%

Running but declining usage

50%

Decommissioned or abandoned

84%

Survivors used semantic layers

The headline is that half of text-to-SQL deployments are gone a year later. Not failed catastrophically — usually just quietly deprioritized, with users drifting back to whatever report or analyst they were relying on before.

Failure modes

Across the decommissioned 50%, the failures clustered into a small number of recurring patterns.

Failure mode 1: The schema is a liability

The most common failure. The organization connected the text-to-SQL system directly to a production database whose schema had evolved organically over years. Tables had names like CUST_PROC_TMP_V2. Date columns in some tables were stored as strings. The same business concept — "customer" — existed in six tables with slightly different semantics. The model, given this raw schema, hallucinated plausible but subtly wrong queries with unsettling regularity.

Failure mode 2: No definition of right

The organization never specified what "correct" meant for their business. Ask any large organization what "active customer" means and you will get four different answers from four different teams. The text-to-SQL system had to pick one. It picked the wrong one often enough to erode trust, and there was no governance framework to override its choice.

Failure mode 3: Unbounded scope

The system was pitched to users as "ask anything," which set expectations no implementation could meet. Users asked about data the system didn't have access to, about business concepts nobody had ever formalized, about hypotheticals that required simulation rather than query. The resulting error rate, measured in user frustration, exceeded the tolerance threshold.

Failure mode 4: No answer confidence

The system returned every query with equal confidence. A straightforward lookup got the same presentation as a multi-table join over ambiguous columns with a Boolean filter the model half-guessed. Users had no way to know which answers to trust, so they learned — correctly — not to trust any of them.

Failure mode 5: Read-only was not really read-only

A handful of deployments allowed the model to generate queries against a live transactional database, against advice. Predictably, at least one query eventually locked a production table during business hours. Most of these deployments were shut down that same week.

What the survivors had in common

Across the 28% still in active production, a different set of patterns emerged. These are the patterns we now treat as preconditions for any new deployment.

1. A semantic layer between model and database

The single most predictive characteristic. Survivors did not expose the raw schema to the model. They exposed a curated semantic layer — a set of well-defined business concepts with human-readable names, clear definitions, and governed mappings to the underlying physical schema.

The semantic layer turned text-to-SQL from "generate arbitrary queries against an organic schema" into "compose validated operations against a small set of curated entities." Accuracy rose dramatically. Hallucinations largely disappeared.

Key insight

Eighty-four percent of the deployments still in production a year later used some form of semantic layer. Of the decommissioned deployments, only 11% did. The semantic layer is not an optimization. It is effectively a precondition for sustained text-to-SQL deployment.

2. Query validation before execution

Survivors ran every generated query through a validation step before execution. The validator checked for dangerous patterns (unbounded table scans, cartesian products, operations against write-protected tables), estimated cost, and blocked queries that exceeded cost thresholds. The user saw a polite "this query is too expensive, please narrow your question" message instead of a crashed database.

3. Confidence-aware presentation

Survivors ranked their own answers. Some answers were presented with high confidence ("here is the exact number"). Some were presented with caveats ("this is based on the sales table, not the finalized_revenue table — confirm with finance for reporting purposes"). Some were not answered at all, and the user was directed to a human. This graded response matched user expectations much better than a single confident-looking answer for everything.

4. Constrained scope, expanding over time

Survivors did not launch with "ask anything." They launched with a specific, curated set of question patterns that the system answered well, and added patterns based on observed user demand. The growth pattern was usage-driven, which meant every new capability was validated against a real need before shipping.

5. Measurable feedback loops

Survivors logged every user interaction: query asked, SQL generated, result returned, user action after (download, share, re-ask, correct). This instrumentation fed a continuous improvement loop that let the team address the specific questions users were actually asking instead of ones the team had imagined.

The governance question

A specific operational concern comes up repeatedly on text-to-SQL deployments: what happens when a user, asking a perfectly reasonable question, receives an answer that is formally correct but materially misleading? For instance, asking "what's our profit margin on product X" and getting back a number that is mathematically derived from the data but doesn't match what finance reports — because the system has quietly picked a different definition of "margin" than finance uses.

This is not a technical problem. It is a governance problem. The survivors in our sample had explicit policies: definitional disagreements were resolved in favor of finance, HR, or compliance depending on the domain. The semantic layer was governed by the same people who governed the equivalent definitions in regulated reports. Text-to-SQL was, in effect, a specific lens onto the organization's already-governed data — not a new source of answers. Systems that tried to be a new source of answers created their own parallel reality, which leadership eventually found intolerable.

What this means for you

If you are evaluating a text-to-SQL deployment, three questions will tell you most of what you need to know:

Is there a semantic layer between the model and the database, and who owns its definitions?
Does the system rank its own confidence, and does the user interface show it?
Is the initial scope narrow and well-defined, with a plan to expand based on measured demand?

If the answer to any of these is unclear, the probability your deployment is in the decommissioned 50% a year from now is uncomfortably high. None of this is unfixable — but fixing it after deployment is much harder than building it in from the start.

Text-to-SQL in the Wild: What 200 Production Deployments Taught Us.

Why this is hard

The cohorts

Failure modes

Failure mode 1: The schema is a liability

Failure mode 2: No definition of right

Failure mode 3: Unbounded scope

Failure mode 4: No answer confidence

Failure mode 5: Read-only was not really read-only

What the survivors had in common

1. A semantic layer between model and database

2. Query validation before execution

3. Confidence-aware presentation

4. Constrained scope, expanding over time

5. Measurable feedback loops

The governance question

What this means for you

Audit Your Text-to-SQL Readiness.

Text-to-SQL in the Wild: What 200 Production Deployments Taught Us.

Why this is hard

The cohorts

Failure modes

Failure mode 1: The schema is a liability

Failure mode 2: No definition of right

Failure mode 3: Unbounded scope

Failure mode 4: No answer confidence

Failure mode 5: Read-only was not really read-only

What the survivors had in common

1. A semantic layer between model and database

2. Query validation before execution

3. Confidence-aware presentation

4. Constrained scope, expanding over time

5. Measurable feedback loops

The governance question

What this means for you

Audit Your Text-to-SQL Readiness.

The Hidden Cost Curve of LLM Deployments

AI Governance Is Data Lineage

AI Rewriting Legacy Banking Cores