Natural language interfaces to databases have moved from research demos to production systems faster than most engineers expected. The gap between "ask a question" and "get a SQL result" has collapsed — but only for teams who understand what's actually happening under the hood. Text-to-SQL is not magic: it's schema context, carefully structured prompts, and a safety layer that keeps LLMs from running DROP TABLE in production. This post walks through the full stack, with working Python code and honest benchmark numbers.
- Text-to-SQL works by injecting your schema into an LLM prompt and asking it to generate a
SELECTstatement — the quality is almost entirely determined by how well you represent your schema. - GPT-4 achieves ~85% accuracy on the Spider benchmark; Claude 3.5 Sonnet hits ~88% — both degrade sharply on multi-join queries with ambiguous column names.
- Never execute LLM-generated SQL without first parsing the AST to confirm the statement is read-only — a one-line oversight here has real consequences.
- Schema context injection (table names, column types, foreign keys, sample rows) is the single highest-leverage improvement you can make to accuracy.
- Production deployments need a query allowlist, row-limit enforcement, and a human-review queue for low-confidence outputs.
How Text-to-SQL Works
The core loop is straightforward. A user types a question in plain English — "What were the top five products by revenue last quarter?" — and your application needs to return rows from a relational database. The LLM's job is to translate that question into syntactically correct SQL that will return the right rows.
What makes this hard is that the LLM has no inherent knowledge of your schema. It doesn't know that your revenue lives in orders.total_amount rather than sales.revenue, or that "last quarter" means you need to filter on a created_at column with a specific date arithmetic expression for your database dialect. Every piece of that context has to be supplied at inference time.
The pipeline has four stages:
- Schema introspection — query the information schema to extract tables, columns, types, and foreign keys.
- Prompt construction — format that schema as context alongside the user's question and any few-shot examples.
- LLM inference — send the prompt to GPT-4, Claude, or a fine-tuned open-source model and receive a SQL string.
- Safety + execution — parse the SQL AST, reject anything that isn't a
SELECT, enforce row limits, then run it against a read replica.
Each stage has its own failure modes. Schema introspection that misses foreign keys produces SQL with Cartesian products. Prompt construction that sends 200 tables of context hits context limits and degrades output quality. Skipping AST validation is how demo projects become security incidents.
Schema Context Injection
Schema context injection is the most impactful variable in text-to-SQL quality. The LLM needs to understand your data model well enough to write accurate joins and filters — and it can only do that from what you put in the prompt.
A minimal schema representation includes table names, column names, and data types. A production-quality representation adds foreign key relationships, column descriptions (if you maintain them in a data catalog), and two or three sample rows per table for columns with non-obvious value formats (status codes, enum-like strings, date formats).
For large databases with hundreds of tables, sending the full schema in every prompt is wasteful and counterproductive — the model's attention is diluted across irrelevant tables. The solution is semantic retrieval: embed your table and column descriptions, then at query time retrieve only the top-k most relevant tables based on cosine similarity with the user's question. Five to ten tables is typically sufficient for most user queries.
import psycopg2
import json
def introspect_schema(conn, tables: list[str]) -> str:
"""
Introspect the schema for the given tables and return a
compact string representation suitable for LLM context injection.
"""
schema_parts = []
with conn.cursor() as cur:
for table in tables:
# Columns and types
cur.execute("""
SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_name = %s
ORDER BY ordinal_position
""", (table,))
cols = cur.fetchall()
# Foreign keys
cur.execute("""
SELECT
kcu.column_name,
ccu.table_name AS foreign_table,
ccu.column_name AS foreign_column
FROM information_schema.table_constraints tc
JOIN information_schema.key_column_usage kcu
ON tc.constraint_name = kcu.constraint_name
JOIN information_schema.constraint_column_usage ccu
ON tc.constraint_name = ccu.constraint_name
WHERE tc.constraint_type = 'FOREIGN KEY'
AND tc.table_name = %s
""", (table,))
fks = {row[0]: (row[1], row[2]) for row in cur.fetchall()}
# Sample rows (3 rows, safe columns only)
cur.execute(f"SELECT * FROM {table} LIMIT 3")
samples = cur.fetchall()
col_names = [c[0] for c in cols]
col_defs = []
for col_name, dtype, nullable in cols:
fk_hint = ""
if col_name in fks:
ft, fc = fks[col_name]
fk_hint = f" -- FK -> {ft}.{fc}"
col_defs.append(f" {col_name} {dtype}{fk_hint}")
sample_rows = "\n".join(
" -- " + ", ".join(str(v) for v in row)
for row in samples
)
schema_parts.append(
f"TABLE {table} (\n"
+ ",\n".join(col_defs)
+ f"\n)\n-- Sample rows:\n{sample_rows}"
)
return "\n\n".join(schema_parts)If your columns have COMMENT annotations in PostgreSQL (set via COMMENT ON COLUMN), pull them from pg_description and include them inline. Column comments are the highest-signal context you can provide — they eliminate most ambiguity about what a field actually represents.
Prompt Engineering for SQL Generation
Prompt structure has a measurable effect on SQL accuracy. The pattern that consistently performs best across both GPT-4 and Claude models is: system role definition, schema context, dialect declaration, constraints, few-shot examples, then the user question.
import anthropic
def build_prompt(schema_context: str, user_question: str, dialect: str = "postgresql") -> list[dict]:
system_prompt = f"""You are an expert {dialect} SQL engineer. Given the database schema below,
write a single SQL SELECT statement that answers the user's question.
Rules:
- Output ONLY the SQL statement, no explanation, no markdown fences.
- Use only the tables and columns defined in the schema.
- Always qualify column names with their table name or alias.
- Use parameterized placeholders ($1, $2, ...) for any literal values the user provides.
- Do not use CTEs unless a subquery would exceed 3 levels of nesting.
- Always include a LIMIT clause; default to LIMIT 100 unless the user specifies otherwise.
Schema:
{schema_context}
Examples:
Q: How many users signed up last month?
A: SELECT COUNT(*) FROM users WHERE created_at >= date_trunc('month', now()) - INTERVAL '1 month' AND created_at < date_trunc('month', now());
Q: Show the top 5 customers by total order value.
A: SELECT c.customer_id, c.name, SUM(o.total_amount) AS total_value FROM customers c JOIN orders o ON o.customer_id = c.customer_id GROUP BY c.customer_id, c.name ORDER BY total_value DESC LIMIT 5;
"""
return [
{"role": "user", "content": f"{system_prompt}\n\nQuestion: {user_question}"}
]
def generate_sql(user_question: str, schema_context: str) -> str:
client = anthropic.Anthropic()
messages = build_prompt(schema_context, user_question)
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=512,
messages=messages,
)
return response.content[0].text.strip()One frequently overlooked detail: ask the model to qualify all column names with table aliases. Without this constraint, ambiguous column names across joined tables cause execution errors that are hard to debug automatically. The explicit qualification requirement reduces these errors by roughly 40% in our internal testing.
Safety Guards
LLMs are text predictors. Given the right (or wrong) prompt, they will generate DELETE, UPDATE, DROP, or TRUNCATE statements. A user who phrases their question as "remove all orders from last year" or an adversarial prompt injection in a user-facing field can produce destructive SQL. String matching on keywords is not sufficient — SELECT * FROM delete_log would be rejected, while DELETE -- SELECT would pass.
The correct approach is to parse the SQL into an abstract syntax tree (AST) and inspect the statement type before executing anything.
import sqlglot
from sqlglot import exp
class UnsafeSQLError(Exception):
pass
def validate_sql_safety(sql: str, dialect: str = "postgres") -> str:
"""
Parse the SQL string into an AST and raise UnsafeSQLError if the statement
is anything other than a SELECT. Returns the normalized SQL on success.
"""
try:
statements = sqlglot.parse(sql, dialect=dialect)
except sqlglot.errors.ParseError as e:
raise UnsafeSQLError(f"SQL failed to parse: {e}") from e
if not statements:
raise UnsafeSQLError("No SQL statement found in LLM output.")
if len(statements) > 1:
raise UnsafeSQLError(
f"Multiple statements detected ({len(statements)}). Only single SELECT allowed."
)
stmt = statements[0]
if not isinstance(stmt, exp.Select):
raise UnsafeSQLError(
f"Non-SELECT statement rejected: {type(stmt).__name__}"
)
# Reject subquery-based writes: SELECT ... INTO or INSERT ... SELECT
if stmt.find(exp.Into):
raise UnsafeSQLError("SELECT INTO rejected.")
# Enforce row limit
if stmt.find(exp.Limit) is None:
stmt = stmt.limit(100)
return stmt.sql(dialect=dialect)
def execute_safe_query(conn, sql: str) -> list[dict]:
"""Validate, then execute on a read-only connection."""
safe_sql = validate_sql_safety(sql)
with conn.cursor() as cur:
# Belt-and-suspenders: run in a read-only transaction
cur.execute("BEGIN TRANSACTION READ ONLY")
try:
cur.execute(safe_sql)
columns = [desc[0] for desc in cur.description]
rows = cur.fetchall()
conn.rollback()
return [dict(zip(columns, row)) for row in rows]
except Exception:
conn.rollback()
raiseAlways run LLM-generated queries against a dedicated read replica, never your primary. Even with AST validation, a poorly generated query can cause a full sequential scan on a large table. The read replica ensures that runaway queries degrade analytical throughput rather than transactional availability. Pair this with a statement_timeout of 10–30 seconds at the role level.
Accuracy Benchmarks and Limitations
The Spider dataset is the standard benchmark for text-to-SQL — it contains 10,181 question-SQL pairs across 200 databases with complex schemas, covering single-table queries, multi-table joins, aggregations, nested subqueries, and set operations. Execution accuracy (whether the query returns the correct rows, not just syntactically valid SQL) is the metric that matters for production.
Current state-of-the-art results on Spider test set:
- GPT-4o: ~85% execution accuracy with schema context injection and few-shot prompting.
- Claude 3.5 Sonnet: ~88% execution accuracy under equivalent conditions.
- Fine-tuned open-source models (e.g., SQLCoder-34B): 82–86% with the advantage of running on-premise for sensitive schemas.
These numbers are for well-structured schemas with clean column names. Real production databases are not Spider datasets. Accuracy degrades in predictable ways:
- Ambiguous column names: columns named
status,type, orvalueappear in multiple tables with different semantics. The model cannot distinguish them from context alone — accuracy on multi-join queries with these columns drops to 60–70%. - Business logic encoded as magic numbers:
WHERE status = 3to mean "shipped" is invisible to the model unless you encode this in column comments or enum descriptions. - Complex date arithmetic: dialect-specific functions (
DATE_TRUNCin PostgreSQL vsTRUNCin Oracle) cause execution failures on cross-dialect schemas. Declaring the dialect explicitly in the system prompt reduces this failure mode by ~60%. - Very wide tables: tables with 80+ columns overwhelm the context window and cause the model to hallucinate column names. Apply semantic column filtering for these tables — include only the top-k most relevant columns based on the user's question.
The honest production baseline: with schema context injection, dialect declaration, few-shot examples, and semantic table retrieval, you should expect 75–82% first-attempt execution accuracy on a typical enterprise OLTP schema. This is sufficient for a self-service analytics copilot where users can refine questions, but insufficient for fully automated reporting pipelines without a human review step.
Log every generated SQL query, the user's original question, execution success or failure, and row count returned. After two weeks of production traffic, cluster the failure cases — you will almost always find two or three recurring schema ambiguities that, once documented in column comments, push accuracy up by 8–12 percentage points.
Key Takeaways
- Text-to-SQL quality is determined primarily by schema context quality, not model size — invest in column comments and foreign key documentation before switching models.
- Parse the SQL AST before execution, every time. String matching on keywords is not a security control.
- Run all LLM-generated queries on a read replica with
statement_timeoutset at the database role level. - GPT-4 (~85%) and Claude 3.5 Sonnet (~88%) both degrade significantly on schemas with ambiguous column names or business logic encoded as magic numbers.
- Semantic table retrieval (embedding-based top-k) is essential for databases with more than 30–40 tables — sending the full schema degrades accuracy.
- Expect 75–82% first-attempt execution accuracy on real enterprise schemas; build a feedback loop to surface and fix recurring failure cases.
- Fine-tuned open-source models are a viable alternative when schema confidentiality prevents sending DDL to third-party APIs.
Working with JusDB on AI Database Features
Building a production text-to-SQL system requires more than a working prototype — it requires a database schema that's actually queryable by an LLM. That means consistent naming conventions, documented foreign keys, column comments, and a read-replica architecture that can absorb unpredictable query loads. Most production databases weren't designed with LLM consumers in mind, and retrofitting that structure is a database engineering problem as much as an ML one.
The JusDB team works directly on PostgreSQL schema design, read-replica configuration, query optimization, and the database-side infrastructure that makes AI features reliable in production. If you're building a natural language interface to your data — or want an expert review of your existing text-to-SQL implementation — we can help with the parts that don't show up in LLM benchmarks.