NL2SQL is an enterprise-grade, agentic Natural Language to SQL pipeline built on LangGraph.
Unlike simple prompt-to-query solutions, this project employs a multi-agent supervisor architecture to handle complex, real-world database interactions. It features dynamic routing across multiple database engines (Postgres, MySQL, MSSQL, SQLite), vector-based schema retrieval for scalability, and a rigorous Plan-Validate-Execute loop to ensure query correctness and safety.
Designed for observability and reliability, it provides detailed performance metrics, step-by-step reasoning logs, and deterministic SQL generation, making it suitable for production environments where accuracy is paramount.
- Supervisor Architecture: Dynamic routing of queries to the appropriate database using a 3-layer routing strategy (Vector, Multi-Query, LLM).
- Cross-Database Support: Map-Reduce architecture to handle complex queries spanning multiple databases (e.g., "Compare sales from MSSQL and inventory from MySQL").
- LangGraph Pipeline: Modular, stateful graph architecture with specialized agents for Intent, Planning, and Validation.
- Multi-Database Support: Seamlessly query across Postgres, MySQL, MSSQL, and SQLite.
- Rule-Based SQL Generation: Deterministic, token-efficient SQL generation using
sqlglotto prevent syntax hallucinations. - Robust Validation Loop: Pre-execution validation of plans against the schema to catch errors early.
- Observability: Real-time streaming of agent reasoning steps and structured JSON logging.
- Vector Search (RAG): Scalable schema retrieval for large databases.
Follow this guide to set up the environment and run your first query.
- Python 3.10+
- Docker & Docker Compose
- OpenAI API Key
git clone https://github.com/nadeem4/nl2sql.git
cd nl2sql
pip install -r requirements.txtStart the database containers (Postgres, MySQL, MSSQL) and seed them with synthetic manufacturing data:
# Start containers
docker-compose up -d
# Seed data (wait ~10s for DBs to initialize)
python scripts/seed_databases.py --waitCreate a .env file in the root directory:
OPENAI_API_KEY="sk-..."Run a query against the Postgres database to verify the pipeline:
python -m src.nl2sql.cli --id manufacturing_ops --query "List 5 machines"Expected Output: A structured log of the AI's reasoning followed by a table of 5 machines.
Define connection strings and settings for each database.
manufacturing_sqlite:
engine: sqlite
connection_string: sqlite:///data/manufacturing.db
tables: [] # Empty list = include all
manufacturing_ops:
engine: postgres
connection_string: postgresql+psycopg2://user:password@localhost:5432/manufacturing_opsMap specific agents to different LLM providers/models.
default:
provider: openai
model: gpt-4o
agents:
planner:
provider: openai
model: gpt-4o-mini # Use a cheaper model for planningMap datasource IDs to lists of sample questions to improve routing accuracy (Layer 1).
manufacturing_ops:
- "List all machines"
- "Which machines are offline?"manufacturing_supply:
- "Show top products by price"
- "List all products"The CLI (src.nl2sql.cli) is the main entry point. It uses the Router Node to automatically select the correct datasource.
--query "...": The natural language question.--show-thoughts: Display step-by-step AI reasoning.--show-perf: Display detailed performance metrics (latency, tokens).--vector-store <PATH>: Use vector search for schema selection (requires indexing).--id <ID>: Optional. Force a specific datasource, bypassing the router (e.g.,manufacturing_ops).--no-exec: Generate and validate SQL without executing it.--json-logs: Enable structured JSON logging.--debug: Enable debug logging for verbose output.--visualize: Visualize the execution trace and save it as a PNG image.
The system simulates a manufacturing enterprise distributed across 4 databases:
| ID | Engine | Content | Example Query |
|---|---|---|---|
manufacturing_ops |
Postgres | Machines, Maintenance, Employees | List 5 machines with their serial numbers |
manufacturing_supply |
MySQL | Products, Inventory, Suppliers | Show me top 3 products by price |
manufacturing_history |
MSSQL | Production Runs, Sales, Defects | Count total production runs |
manufacturing_ref |
SQLite | Factories, Shifts, Machine Types | List all factories and their locations |
Run these commands to test each database:
python -m src.nl2sql.cli --id manufacturing_ops --query "List 5 machines with their serial numbers"python -m src.nl2sql.cli --id manufacturing_supply --query "Show me top 3 products by price"python -m src.nl2sql.cli --id manufacturing_history --query "Count total production runs"python -m src.nl2sql.cli --id manufacturing_ref --query "List all factories and their locations"The system can automatically decompose complex queries into sub-queries, execute them in parallel, and aggregate the results.
python -m src.nl2sql.cli --query "Compare sales from manufacturing_history and inventory from manufacturing_supply"python -m src.nl2sql.cli --query "Show production runs from manufacturing_history and maintenance logs from manufacturing_ops for 'Machine-1'"python -m src.nl2sql.cli --query "List all employees from manufacturing_ops and their factory locations from manufacturing_ref"Expected Output: A consolidated answer merging data from the respective databases.
For large schemas, use vector search to dynamically select relevant tables.
-
Index the Schema:
python -m src.nl2sql.cli --index --vector-store ./chroma_db
-
Query with Context:
python -m src.nl2sql.cli --query "Show top 5 products" --vector-store ./chroma_db
- Stream Reasoning: Use
--show-thoughtsto see the Intent, Planner, and Generator steps. - JSON Logs: Use
--json-logsfor structured output suitable for log ingestion. - Debug Mode: Use
--debugfor verbose output. - Graph Visualization: Use
--visualizeto generate a visual graph of the execution trace.
We use a Supervisor Architecture where the graph itself manages routing and state.
flowchart TD
user["User Query"] --> decomposer["Decomposer (AI)"]
decomposer -- Single DB --> router["Router (AI)"]
decomposer -- Multi DB --> split["Parallel Execution"]
subgraph Execution Pipeline
router --> intent["Intention (AI)"]
intent --> schema["Schema (non-AI)"]
schema --> planning["Planning Subgraph"]
subgraph Planning Subgraph
planner["Planner (AI)"] --> validator["Validator (non-AI)"]
validator -- Invalid --> summarizer["Summarizer (AI)"]
summarizer --> planner
end
validator -- Valid --> generator["SQL Generator (non-AI)"]
generator --> executor["Executor (non-AI)"]
end
split --> router
executor --> aggregator["Aggregator (AI)"]
aggregator --> answer["Final Answer"]
style user fill:#f6f8fa,stroke:#aaa
style decomposer fill:#e1f5fe,stroke:#01579b
style router fill:#e1f5fe,stroke:#01579b
style aggregator fill:#e1f5fe,stroke:#01579b
style answer fill:#f6f8fa,stroke:#aaa
To support efficient querying across large or multiple databases, we use a two-tiered vectorization approach:
-
Datasource Routing:
- What: Indexes the
descriptionof each database. - Why: Determines which database contains the relevant data (e.g., "Sales" vs. "Inventory").
- Strategy:
- Layer 1 (Fast): Vector search against database descriptions and 200+ sample questions.
- Layer 2 (Robust): If confidence is low (distance > 0.4), an LLM generates 3 query variations and votes on the best datasource.
- Layer 3 (Reasoning): If Layer 2 fails or remains uncertain, a dedicated LLM Agent analyzes the schema definitions to make a final decision.
- What: Indexes the
-
Schema Selection:
- What: Indexes table metadata (columns, foreign keys, comments).
- Why: Determines which tables are needed for the query within the selected database.
This allows the system to scale to hundreds of tables without overwhelming the LLM's context window.
- Intent (AI): Classifies query type and extracts entities.
- Planner (AI): Generates a database-agnostic structured plan (tables, joins, filters).
- Validator (Code): Verifies the plan against the schema (column existence, types).
- SQL Generator (Code): Deterministically compiles the plan to SQL using
sqlglot(0 tokens). - Executor (Code): Runs the SQL (read-only) and returns results.
The CLI provides a detailed breakdown of time and token usage, including a top-level matrix and per-datasource details:
Performance & Metrics
╭──────────────────────────────────────────────────────────────────────────────────────────╮
│ Top Level Performance │
│ ┏━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓ │
│ ┃ Metric ┃ Decomposer ┃ Aggregator ┃ Exec (manufacturing_ops) ┃ Total ┃ │
│ ┡━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩ │
│ │ Latency (s) │ 0.0000 │ 0.0000 │ 7.2700 │ 7.2700 │ │
│ │ Token Usage │ 0 │ 0 │ 3029 │ 3029 │ │
│ └─────────────┴────────────┴────────────┴──────────────────────────┴───────────────────┘ │
│ │
│ Performance: manufacturing_ops │
│ ┏━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┓ │
│ ┃ Node ┃ Type ┃ Model ┃ Latency (s) ┃ Tokens ┃ │
│ ┡━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━┩ │
│ │ Intent │ AI │ gpt-4o-mini │ 1.9300 │ 569 │ │
│ │ Planner │ AI │ gpt-4o-mini │ 4.9900 │ 2456 │ │
│ │ Generator │ Non-AI │ - │ 0.0000 │ - │ │
│ │ Executor │ Non-AI │ - │ 0.3500 │ - │ │
│ └───────────┴──────────┴─────────────┴─────────────┴────────┘ │
╰──────────────────────────────────────────────────────────────────────────────────────────╯
src/: Core modules (nodes,langgraph_pipeline,datasource_config,llm_registry).configs/: YAML configurations for datasources and LLMs.scripts/: Utilities (e.g.,seed_databases.py).tests/: Unit and integration tests.
We employ a rigorous evaluation framework to ensure the pipeline's reliability across different complexity levels.
A curated suite of 20 test cases designed to stress-test the Router and SQL Generation.
| Tier | Complexity | Goal | Target Layer |
|---|---|---|---|
| Easy | Keyword Match | Test speed & exact retrieval | Layer 1 (Vector) |
| Medium | Ambiguous/Slang | Test enrichment & multi-query | Layer 2 (Multi-Query) |
| Hard | Reasoning Required | Test intent understanding | Layer 3 (LLM) |
The benchmark tool captures:
- Routing Accuracy: % of queries routed to the correct datasource.
- Execution Accuracy: % of generated SQL returning the correct rows (compared to ground truth).
- Latency: End-to-end time (aiming for < 1s for L1, < 3s for L2).
- Token Usage: Cost efficiency (aiming for 0 tokens for L1 cache hits).
A. Routing-Only Evaluation Test just the router's ability to pick the right DB.
python -m src.nl2sql.cli --benchmark --dataset tests/golden_dataset.yaml --routing-onlyB. End-to-End Evaluation Test routing + SQL generation + execution correctness.
python -m src.nl2sql.cli --benchmark --dataset tests/golden_dataset.yamlC. Regression Testing Use the stub LLM to test the harness itself without API costs.
python -m src.nl2sql.cli --benchmark --dataset tests/golden_dataset.yaml --stub-llmRun the test suite using pytest:
python -m pytest tests/- Add the driver to
requirements.txt. - Add a connection profile to
configs/datasources.yaml. - (Optional) Add specific DDL/Seeding logic to
scripts/seed_databases.py.