A comprehensive, reusable framework for evaluating natural-language-to-SQL generation systems. This notebook combines automated unit tests, SQL execution validation, and LLM-as-a-judge semantic scoring to systematically evaluate and compare different prompting strategies.
This notebook is designed to be customized for your own text-to-SQL evaluation needs.
Replace the dataset loading cell with your data. Minimum requirement: a question column. Optional: sql (reference SQL) and context (schema info) for additional metrics.
# Replace the dataset loading cell with:
df = pd.DataFrame({
'question': ['Your question 1', 'Your question 2', ...], # Required
'sql': ['SELECT ...', 'SELECT ...', ...], # Optional: reference SQL
'context': ['CREATE TABLE ...', ...] # Optional: schema info
})
# Or load from CSV/JSON:
df = pd.read_csv('your_dataset.csv')In the environment setup cell, modify the EvalConfig:
eval_config = EvalConfig(
generator_model="gpt-4", # Your SQL generation model
judge_model="gpt-4o-mini", # Your evaluation model
dataset_slice=100, # Limit to 100 examples
mysql_host="localhost", # MySQL server hostname
mysql_user="root", # MySQL username
mysql_password="your-password", # MySQL password
mysql_database="test_sql_eval" # MySQL database name
)Modify instruction_strategy_a and instruction_strategy_b in the comparison section, or create your own prompt strategies.
Execute cells sequentially. The framework handles the rest!
- JSON Structure: Validates that the LLM output conforms to the expected schema
- SQL Syntax: Ensures generated SQL can be parsed and executed without errors
- Semantic Relevance: Measures how well the SQL answers the original question (0-100 scale)
- Query Complexity: Categorizes queries by difficulty (simple/medium/complex)
- Execution Accuracy: Compares generated SQL to reference queries when available
The framework requires at minimum a question column. Additional columns enable more metrics:
question(required): Natural language questions to convert to SQLsql(optional): Reference SQL queries for accuracy comparisoncontext(optional): Schema definitions (CREATE TABLE statements) to provide context
Example:
df = pd.DataFrame({
'question': [
"What is the total revenue for each product category?",
"List all customers who have placed more than 5 orders"
],
'sql': [
"SELECT category, SUM(revenue) FROM products GROUP BY category",
"SELECT customer_id FROM orders GROUP BY customer_id HAVING COUNT(*) > 5"
],
'context': [
"CREATE TABLE products (id INT, category TEXT, revenue FLOAT)",
"CREATE TABLE orders (id INT, customer_id INT, order_date DATE)"
]
})Edit the EvalConfig in the environment setup cell:
eval_config = EvalConfig(
generator_model="gpt-4", # Model for SQL generation
judge_model="gpt-4o-mini", # Model for semantic scoring
dataset_slice=50, # Limit to 50 examples (None = all)
semantic_score_min=0, # Minimum semantic score
semantic_score_max=100, # Maximum semantic score
enable_complexity_analysis=True, # Analyze query complexity
enable_execution_accuracy=True, # Compare to reference SQL
mysql_host="localhost", # MySQL server hostname
mysql_user="root", # MySQL username
mysql_password="your-password", # MySQL password
mysql_database="test_sql_eval" # MySQL database name
)Supported models: Any OpenAI-compatible model (gpt-4, gpt-4o-mini, gpt-3.5-turbo, etc.)
Modify the prompt strategies in the "Comparing Prompt Variants" section:
instruction_strategy_a = """Your custom prompt here.
Include instructions for:
- SQL generation requirements
- Output format (JSON with create_sql, select_sql)
- Any domain-specific constraints
"""
instruction_strategy_b = """Alternative prompt strategy for comparison."""Control evaluation behavior through EvalConfig:
dataset_slice: Limit number of examples (useful for testing)semantic_score_min/max: Adjust scoring scale (default 0-100)enable_complexity_analysis: Toggle query complexity categorizationenable_execution_accuracy: Toggle comparison to reference SQL
This framework is valuable for:
- Data Engineers & Analysts: Who need to evaluate and improve natural language to SQL conversion systems
- ML Engineers: Building or fine-tuning LLM-based SQL generation models who need systematic evaluation metrics
- Product Teams: Developing conversational database interfaces or BI tools that generate SQL from user queries
- Researchers: Studying prompt engineering strategies and comparing different LLM approaches for code generation
- DevOps/QA Teams: Implementing automated testing pipelines for SQL generation systems in production
- Anyone building text-to-SQL systems: Who wants a reusable, extensible evaluation framework to measure and improve their system's reliability
Even when a single example works, you still need systematic evaluation because:
- One example doesn't tell the whole story: A model might work on simple queries but fail on complex ones, edge cases, or domain-specific questions
- Prompt engineering requires comparison: You need to test multiple prompt strategies to find which one performs best across many examples, not just one
- Production reliability: In production, you'll face thousands of queries. Without systematic testing, you won't know your failure rate or which types of queries are problematic
- Model updates break things: When you update your LLM, change prompts, or switch models, you need to ensure quality hasn't degraded across your entire test suite
- Cost optimization: Different prompts/models have different costs. This framework helps you find the best balance between quality and cost
- Debugging at scale: When something goes wrong in production, this framework helps you identify patterns (e.g., "queries with JOINs fail 30% of the time") rather than debugging one-off issues
- Confidence in deployment: Before deploying a new prompt or model to production, you need quantitative evidence that it's better than the current version
Real-world scenario: You might test 10 queries manually and they all work, but when you deploy to production with 1000+ daily queries, you discover 15% fail due to SQL syntax errors or semantic mismatches. This framework catches those issues before deployment.
pip install openai pandas pydantic numpy matplotlib tqdm python-dotenv pymysqlpip install datasets # For loading Hugging Face datasets
pip install seaborn # For enhanced plotting styles-
Set your OpenAI API key in a
.envfile:OPENAI_API_KEY=your-api-key-here -
Set your MySQL connection details in a
.envfile (or use defaults):MYSQL_HOST=localhost MYSQL_USER=root MYSQL_PASSWORD=your-password MYSQL_DATABASE=test_sql_eval -
Or export them as environment variables:
export OPENAI_API_KEY=your-api-key-here export MYSQL_HOST=localhost export MYSQL_USER=root export MYSQL_PASSWORD=your-password export MYSQL_DATABASE=test_sql_eval
The framework expects LLM outputs in a structured JSON format:
{
"create_sql": "CREATE TABLE ...",
"select_sql": "SELECT ...",
"explanation": "Optional explanation"
}- Payload Validation: Checks JSON structure and required fields
- SQL Execution: Validates SQL syntax and executes in MySQL
- Semantic Scoring: Uses LLM-as-judge to score relevance (0-100)
- Complexity Analysis: Categorizes queries as simple/medium/complex
- Accuracy Comparison: Compares to reference SQL when available
- Validation Pass Rate: Percentage of queries passing all structural checks
- Semantic Score: Average relevance score (0-100)
- Error Breakdown: Categorization by error type (payload, execution)
- Complexity Performance: Performance by query complexity level
- Length Analysis: Error patterns by question length
- Setup: Configure
EvalConfigwith your models and settings - Load Data: Use your dataset or the provided sample
- Define Prompts: Create prompt strategies to compare
- Run Evaluation: Execute the evaluation suite
- Analyze Results: Review visualizations and metrics
- Iterate: Refine prompts based on results
You can extend the evaluation pipeline by modifying execute_eval_suite():
def execute_eval_suite(df, generator_instruction, config):
# ... existing code ...
# Add your custom metric
result['custom_metric'] = compute_custom_metric(parsed_output)
return resultsModify invoke_llm_generator() to support other providers:
def invoke_llm_generator(instruction, user_input, config):
# Replace OpenAI client with your provider
# e.g., Anthropic, Cohere, etc.
passExtend run_sql_sanity_checks() to add domain-specific validation:
def run_sql_sanity_checks(parsed_response):
# ... existing checks ...
# Add custom validation
if not validate_custom_rule(parsed_response.select_sql):
return False, "Custom validation failed"
return True, None