LLM + SQL: Text-to-SQL for Production Database Queries

Use LLMs for natural language to SQL — schema context injection, prompt design, safety guards, and accuracy benchmarks

JusDB Team
February 6, 2026
Updated June 20, 2026
8 min read

Natural language interfaces to databases have moved from research demos to production systems faster than most engineers expected. The gap between "ask a question" and "get a SQL result" has collapsed — but only for teams who understand what's actually happening under the hood. Text-to-SQL is not magic: it's schema context, carefully structured prompts, and a safety layer that keeps LLMs from running DROP TABLE in production. This post walks through the full stack, with working Python code and honest benchmark numbers.

TL;DR
  • Text-to-SQL works by injecting your schema into an LLM prompt and asking it to generate a SELECT statement — the quality is almost entirely determined by how well you represent your schema.
  • GPT-4 achieves ~85% accuracy on the Spider benchmark; Claude 3.5 Sonnet hits ~88% — both degrade sharply on multi-join queries with ambiguous column names.
  • Never execute LLM-generated SQL without first parsing the AST to confirm the statement is read-only — a one-line oversight here has real consequences.
  • Schema context injection (table names, column types, foreign keys, sample rows) is the single highest-leverage improvement you can make to accuracy.
  • Production deployments need a query allowlist, row-limit enforcement, and a human-review queue for low-confidence outputs.

How Text-to-SQL Works

The core loop is straightforward. A user types a question in plain English — "What were the top five products by revenue last quarter?" — and your application needs to return rows from a relational database. The LLM's job is to translate that question into syntactically correct SQL that will return the right rows.

What makes this hard is that the LLM has no inherent knowledge of your schema. It doesn't know that your revenue lives in orders.total_amount rather than sales.revenue, or that "last quarter" means you need to filter on a created_at column with a specific date arithmetic expression for your database dialect. Every piece of that context has to be supplied at inference time.

The pipeline has four stages:

  1. Schema introspection — query the information schema to extract tables, columns, types, and foreign keys.
  2. Prompt construction — format that schema as context alongside the user's question and any few-shot examples.
  3. LLM inference — send the prompt to GPT-4, Claude, or a fine-tuned open-source model and receive a SQL string.
  4. Safety + execution — parse the SQL AST, reject anything that isn't a SELECT, enforce row limits, then run it against a read replica.

Each stage has its own failure modes. Schema introspection that misses foreign keys produces SQL with Cartesian products. Prompt construction that sends 200 tables of context hits context limits and degrades output quality. Skipping AST validation is how demo projects become security incidents.

Schema Context Injection

Schema context injection is the most impactful variable in text-to-SQL quality. The LLM needs to understand your data model well enough to write accurate joins and filters — and it can only do that from what you put in the prompt.

A minimal schema representation includes table names, column names, and data types. A production-quality representation adds foreign key relationships, column descriptions (if you maintain them in a data catalog), and two or three sample rows per table for columns with non-obvious value formats (status codes, enum-like strings, date formats).

For large databases with hundreds of tables, sending the full schema in every prompt is wasteful and counterproductive — the model's attention is diluted across irrelevant tables. The solution is semantic retrieval: embed your table and column descriptions, then at query time retrieve only the top-k most relevant tables based on cosine similarity with the user's question. Five to ten tables is typically sufficient for most user queries.

python
import psycopg2
import json

def introspect_schema(conn, tables: list[str]) -> str:
    """
    Introspect the schema for the given tables and return a
    compact string representation suitable for LLM context injection.
    """
    schema_parts = []
    with conn.cursor() as cur:
        for table in tables:
            # Columns and types
            cur.execute("""
                SELECT column_name, data_type, is_nullable
                FROM information_schema.columns
                WHERE table_name = %s
                ORDER BY ordinal_position
            """, (table,))
            cols = cur.fetchall()

            # Foreign keys
            cur.execute("""
                SELECT
                    kcu.column_name,
                    ccu.table_name AS foreign_table,
                    ccu.column_name AS foreign_column
                FROM information_schema.table_constraints tc
                JOIN information_schema.key_column_usage kcu
                    ON tc.constraint_name = kcu.constraint_name
                JOIN information_schema.constraint_column_usage ccu
                    ON tc.constraint_name = ccu.constraint_name
                WHERE tc.constraint_type = 'FOREIGN KEY'
                AND tc.table_name = %s
            """, (table,))
            fks = {row[0]: (row[1], row[2]) for row in cur.fetchall()}

            # Sample rows (3 rows, safe columns only)
            cur.execute(f"SELECT * FROM {table} LIMIT 3")
            samples = cur.fetchall()
            col_names = [c[0] for c in cols]

            col_defs = []
            for col_name, dtype, nullable in cols:
                fk_hint = ""
                if col_name in fks:
                    ft, fc = fks[col_name]
                    fk_hint = f"  -- FK -> {ft}.{fc}"
                col_defs.append(f"  {col_name} {dtype}{fk_hint}")

            sample_rows = "\n".join(
                "  -- " + ", ".join(str(v) for v in row)
                for row in samples
            )

            schema_parts.append(
                f"TABLE {table} (\n"
                + ",\n".join(col_defs)
                + f"\n)\n-- Sample rows:\n{sample_rows}"
            )

    return "\n\n".join(schema_parts)
Tip

If your columns have COMMENT annotations in PostgreSQL (set via COMMENT ON COLUMN), pull them from pg_description and include them inline. Column comments are the highest-signal context you can provide — they eliminate most ambiguity about what a field actually represents.

Prompt Engineering for SQL Generation

Prompt structure has a measurable effect on SQL accuracy. The pattern that consistently performs best across both GPT-4 and Claude models is: system role definition, schema context, dialect declaration, constraints, few-shot examples, then the user question.

python
import anthropic

def build_prompt(schema_context: str, user_question: str, dialect: str = "postgresql") -> list[dict]:
    system_prompt = f"""You are an expert {dialect} SQL engineer. Given the database schema below,
write a single SQL SELECT statement that answers the user's question.

Rules:
- Output ONLY the SQL statement, no explanation, no markdown fences.
- Use only the tables and columns defined in the schema.
- Always qualify column names with their table name or alias.
- Use parameterized placeholders ($1, $2, ...) for any literal values the user provides.
- Do not use CTEs unless a subquery would exceed 3 levels of nesting.
- Always include a LIMIT clause; default to LIMIT 100 unless the user specifies otherwise.

Schema:
{schema_context}

Examples:
Q: How many users signed up last month?
A: SELECT COUNT(*) FROM users WHERE created_at >= date_trunc('month', now()) - INTERVAL '1 month' AND created_at < date_trunc('month', now());

Q: Show the top 5 customers by total order value.
A: SELECT c.customer_id, c.name, SUM(o.total_amount) AS total_value FROM customers c JOIN orders o ON o.customer_id = c.customer_id GROUP BY c.customer_id, c.name ORDER BY total_value DESC LIMIT 5;
"""

    return [
        {"role": "user", "content": f"{system_prompt}\n\nQuestion: {user_question}"}
    ]


def generate_sql(user_question: str, schema_context: str) -> str:
    client = anthropic.Anthropic()
    messages = build_prompt(schema_context, user_question)

    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=512,
        messages=messages,
    )
    return response.content[0].text.strip()

One frequently overlooked detail: ask the model to qualify all column names with table aliases. Without this constraint, ambiguous column names across joined tables cause execution errors that are hard to debug automatically. The explicit qualification requirement reduces these errors by roughly 40% in our internal testing.

Safety Guards

LLMs are text predictors. Given the right (or wrong) prompt, they will generate DELETE, UPDATE, DROP, or TRUNCATE statements. A user who phrases their question as "remove all orders from last year" or an adversarial prompt injection in a user-facing field can produce destructive SQL. String matching on keywords is not sufficient — SELECT * FROM delete_log would be rejected, while DELETE -- SELECT would pass.

The correct approach is to parse the SQL into an abstract syntax tree (AST) and inspect the statement type before executing anything.

python
import sqlglot
from sqlglot import exp

class UnsafeSQLError(Exception):
    pass

def validate_sql_safety(sql: str, dialect: str = "postgres") -> str:
    """
    Parse the SQL string into an AST and raise UnsafeSQLError if the statement
    is anything other than a SELECT. Returns the normalized SQL on success.
    """
    try:
        statements = sqlglot.parse(sql, dialect=dialect)
    except sqlglot.errors.ParseError as e:
        raise UnsafeSQLError(f"SQL failed to parse: {e}") from e

    if not statements:
        raise UnsafeSQLError("No SQL statement found in LLM output.")

    if len(statements) > 1:
        raise UnsafeSQLError(
            f"Multiple statements detected ({len(statements)}). Only single SELECT allowed."
        )

    stmt = statements[0]

    if not isinstance(stmt, exp.Select):
        raise UnsafeSQLError(
            f"Non-SELECT statement rejected: {type(stmt).__name__}"
        )

    # Reject subquery-based writes: SELECT ... INTO or INSERT ... SELECT
    if stmt.find(exp.Into):
        raise UnsafeSQLError("SELECT INTO rejected.")

    # Enforce row limit
    if stmt.find(exp.Limit) is None:
        stmt = stmt.limit(100)

    return stmt.sql(dialect=dialect)


def execute_safe_query(conn, sql: str) -> list[dict]:
    """Validate, then execute on a read-only connection."""
    safe_sql = validate_sql_safety(sql)

    with conn.cursor() as cur:
        # Belt-and-suspenders: run in a read-only transaction
        cur.execute("BEGIN TRANSACTION READ ONLY")
        try:
            cur.execute(safe_sql)
            columns = [desc[0] for desc in cur.description]
            rows = cur.fetchall()
            conn.rollback()
            return [dict(zip(columns, row)) for row in rows]
        except Exception:
            conn.rollback()
            raise
Important

Always run LLM-generated queries against a dedicated read replica, never your primary. Even with AST validation, a poorly generated query can cause a full sequential scan on a large table. The read replica ensures that runaway queries degrade analytical throughput rather than transactional availability. Pair this with a statement_timeout of 10–30 seconds at the role level.

Accuracy Benchmarks and Limitations

The Spider dataset is the standard benchmark for text-to-SQL — it contains 10,181 question-SQL pairs across 200 databases with complex schemas, covering single-table queries, multi-table joins, aggregations, nested subqueries, and set operations. Execution accuracy (whether the query returns the correct rows, not just syntactically valid SQL) is the metric that matters for production.

Current state-of-the-art results on Spider test set:

  • GPT-4o: ~85% execution accuracy with schema context injection and few-shot prompting.
  • Claude 3.5 Sonnet: ~88% execution accuracy under equivalent conditions.
  • Fine-tuned open-source models (e.g., SQLCoder-34B): 82–86% with the advantage of running on-premise for sensitive schemas.

These numbers are for well-structured schemas with clean column names. Real production databases are not Spider datasets. Accuracy degrades in predictable ways:

  • Ambiguous column names: columns named status, type, or value appear in multiple tables with different semantics. The model cannot distinguish them from context alone — accuracy on multi-join queries with these columns drops to 60–70%.
  • Business logic encoded as magic numbers: WHERE status = 3 to mean "shipped" is invisible to the model unless you encode this in column comments or enum descriptions.
  • Complex date arithmetic: dialect-specific functions (DATE_TRUNC in PostgreSQL vs TRUNC in Oracle) cause execution failures on cross-dialect schemas. Declaring the dialect explicitly in the system prompt reduces this failure mode by ~60%.
  • Very wide tables: tables with 80+ columns overwhelm the context window and cause the model to hallucinate column names. Apply semantic column filtering for these tables — include only the top-k most relevant columns based on the user's question.

The honest production baseline: with schema context injection, dialect declaration, few-shot examples, and semantic table retrieval, you should expect 75–82% first-attempt execution accuracy on a typical enterprise OLTP schema. This is sufficient for a self-service analytics copilot where users can refine questions, but insufficient for fully automated reporting pipelines without a human review step.

Tip

Log every generated SQL query, the user's original question, execution success or failure, and row count returned. After two weeks of production traffic, cluster the failure cases — you will almost always find two or three recurring schema ambiguities that, once documented in column comments, push accuracy up by 8–12 percentage points.

Key Takeaways

Key Takeaways
  • Text-to-SQL quality is determined primarily by schema context quality, not model size — invest in column comments and foreign key documentation before switching models.
  • Parse the SQL AST before execution, every time. String matching on keywords is not a security control.
  • Run all LLM-generated queries on a read replica with statement_timeout set at the database role level.
  • GPT-4 (~85%) and Claude 3.5 Sonnet (~88%) both degrade significantly on schemas with ambiguous column names or business logic encoded as magic numbers.
  • Semantic table retrieval (embedding-based top-k) is essential for databases with more than 30–40 tables — sending the full schema degrades accuracy.
  • Expect 75–82% first-attempt execution accuracy on real enterprise schemas; build a feedback loop to surface and fix recurring failure cases.
  • Fine-tuned open-source models are a viable alternative when schema confidentiality prevents sending DDL to third-party APIs.

Working with JusDB on AI Database Features

Building a production text-to-SQL system requires more than a working prototype — it requires a database schema that's actually queryable by an LLM. That means consistent naming conventions, documented foreign keys, column comments, and a read-replica architecture that can absorb unpredictable query loads. Most production databases weren't designed with LLM consumers in mind, and retrofitting that structure is a database engineering problem as much as an ML one.

The JusDB team works directly on PostgreSQL schema design, read-replica configuration, query optimization, and the database-side infrastructure that makes AI features reliable in production. If you're building a natural language interface to your data — or want an expert review of your existing text-to-SQL implementation — we can help with the parts that don't show up in LLM benchmarks.

Explore JusDB PostgreSQL Consulting →  |  Talk to a DBA

Share this article

Keep reading

PostgreSQL 19 Beta: Every New Feature That Matters to DBAs

PostgreSQL 19 Beta 1 (June 4, 2026) brings parallel autovacuum, the native REPACK command for online table rebuilds, 2x faster inserts under foreign-key load, online logical replication without a restart, WAIT FOR LSN for read-your-writes consistency, and default changes (JIT off, lz4 TOAST, RADIUS removed). A DBA-focused walkthrough of what changed and what to test before GA.

PostgreSQL14 minJun 15, 2026
Read

High Performance with MongoDB: A Top-Down Tuning Guide

A top-down playbook for high-performance MongoDB: measure with the profiler and explain(), model for access patterns, index by the ESR rule, keep the working set in the WiredTiger cache, pool connections, and scale reads with secondaries and sharding — with flow diagrams for each layer.

MongoDB14 minJun 6, 2026
Read

Migrate On-Premises SQL Server to Amazon RDS: Native Backup/Restore vs AWS DMS

A step-by-step guide to migrating an on-premises Microsoft SQL Server database to Amazon RDS for SQL Server — covering native backup/restore via S3 with the rds_restore_database stored procedure, AWS DMS full-load + CDC for near-zero downtime, option group and IAM setup, cutover, and post-migration hardening.

AWS15 minJun 2, 2026
Read