LLM-Generated SQL Query Evaluation Framework

A comprehensive, reusable framework for evaluating natural-language-to-SQL generation systems. This notebook combines automated unit tests, SQL execution validation, and LLM-as-a-judge semantic scoring to systematically evaluate and compare different prompting strategies.

Quick Start

This notebook is designed to be customized for your own text-to-SQL evaluation needs.

Use Your Own Dataset

Replace the dataset loading cell with your data. Minimum requirement: a question column. Optional: sql (reference SQL) and context (schema info) for additional metrics.

# Replace the dataset loading cell with:
df = pd.DataFrame({
    'question': ['Your question 1', 'Your question 2', ...],  # Required
    'sql': ['SELECT ...', 'SELECT ...', ...],                # Optional: reference SQL
    'context': ['CREATE TABLE ...', ...]                     # Optional: schema info
})

# Or load from CSV/JSON:
df = pd.read_csv('your_dataset.csv')

Change Models

In the environment setup cell, modify the EvalConfig:

eval_config = EvalConfig(
    generator_model="gpt-4",      # Your SQL generation model
    judge_model="gpt-4o-mini",    # Your evaluation model
    dataset_slice=100,            # Limit to 100 examples
    mysql_host="localhost",       # MySQL server hostname
    mysql_user="root",            # MySQL username
    mysql_password="your-password", # MySQL password
    mysql_database="test_sql_eval" # MySQL database name
)

Customize Prompts

Modify instruction_strategy_a and instruction_strategy_b in the comparison section, or create your own prompt strategies.

Run Evaluation

Execute cells sequentially. The framework handles the rest!

What Is Evaluated

JSON Structure: Validates that the LLM output conforms to the expected schema
SQL Syntax: Ensures generated SQL can be parsed and executed without errors
Semantic Relevance: Measures how well the SQL answers the original question (0-100 scale)
Query Complexity: Categorizes queries by difficulty (simple/medium/complex)
Execution Accuracy: Compares generated SQL to reference queries when available

Full Customization Guide

1. Using Your Own Dataset

The framework requires at minimum a question column. Additional columns enable more metrics:

question (required): Natural language questions to convert to SQL
sql (optional): Reference SQL queries for accuracy comparison
context (optional): Schema definitions (CREATE TABLE statements) to provide context

Example:

df = pd.DataFrame({
    'question': [
        "What is the total revenue for each product category?",
        "List all customers who have placed more than 5 orders"
    ],
    'sql': [
        "SELECT category, SUM(revenue) FROM products GROUP BY category",
        "SELECT customer_id FROM orders GROUP BY customer_id HAVING COUNT(*) > 5"
    ],
    'context': [
        "CREATE TABLE products (id INT, category TEXT, revenue FLOAT)",
        "CREATE TABLE orders (id INT, customer_id INT, order_date DATE)"
    ]
})

2. Changing Models

Edit the EvalConfig in the environment setup cell:

eval_config = EvalConfig(
    generator_model="gpt-4",           # Model for SQL generation
    judge_model="gpt-4o-mini",         # Model for semantic scoring
    dataset_slice=50,                  # Limit to 50 examples (None = all)
    semantic_score_min=0,              # Minimum semantic score
    semantic_score_max=100,            # Maximum semantic score
    enable_complexity_analysis=True,   # Analyze query complexity
    enable_execution_accuracy=True,    # Compare to reference SQL
    mysql_host="localhost",            # MySQL server hostname
    mysql_user="root",                 # MySQL username
    mysql_password="your-password",     # MySQL password
    mysql_database="test_sql_eval"     # MySQL database name
)

Supported models: Any OpenAI-compatible model (gpt-4, gpt-4o-mini, gpt-3.5-turbo, etc.)

3. Customizing Prompts

Modify the prompt strategies in the "Comparing Prompt Variants" section:

instruction_strategy_a = """Your custom prompt here.

Include instructions for:
- SQL generation requirements
- Output format (JSON with create_sql, select_sql)
- Any domain-specific constraints
"""

instruction_strategy_b = """Alternative prompt strategy for comparison."""

4. Adjusting Evaluation Settings

Control evaluation behavior through EvalConfig:

dataset_slice: Limit number of examples (useful for testing)
semantic_score_min/max: Adjust scoring scale (default 0-100)
enable_complexity_analysis: Toggle query complexity categorization
enable_execution_accuracy: Toggle comparison to reference SQL

Who Can Benefit From This?

This framework is valuable for:

Data Engineers & Analysts: Who need to evaluate and improve natural language to SQL conversion systems
ML Engineers: Building or fine-tuning LLM-based SQL generation models who need systematic evaluation metrics
Product Teams: Developing conversational database interfaces or BI tools that generate SQL from user queries
Researchers: Studying prompt engineering strategies and comparing different LLM approaches for code generation
DevOps/QA Teams: Implementing automated testing pipelines for SQL generation systems in production
Anyone building text-to-SQL systems: Who wants a reusable, extensible evaluation framework to measure and improve their system's reliability

Why Is This Framework Needed?

Even when a single example works, you still need systematic evaluation because:

One example doesn't tell the whole story: A model might work on simple queries but fail on complex ones, edge cases, or domain-specific questions
Prompt engineering requires comparison: You need to test multiple prompt strategies to find which one performs best across many examples, not just one
Production reliability: In production, you'll face thousands of queries. Without systematic testing, you won't know your failure rate or which types of queries are problematic
Model updates break things: When you update your LLM, change prompts, or switch models, you need to ensure quality hasn't degraded across your entire test suite
Cost optimization: Different prompts/models have different costs. This framework helps you find the best balance between quality and cost
Debugging at scale: When something goes wrong in production, this framework helps you identify patterns (e.g., "queries with JOINs fail 30% of the time") rather than debugging one-off issues
Confidence in deployment: Before deploying a new prompt or model to production, you need quantitative evidence that it's better than the current version

Real-world scenario: You might test 10 queries manually and they all work, but when you deploy to production with 1000+ daily queries, you discover 15% fail due to SQL syntax errors or semantic mismatches. This framework catches those issues before deployment.

Requirements

Required Packages

pip install openai pandas pydantic numpy matplotlib tqdm python-dotenv pymysql

Optional Packages

pip install datasets      # For loading Hugging Face datasets
pip install seaborn     # For enhanced plotting styles

Environment Setup

Set your OpenAI API key in a .env file:
```
OPENAI_API_KEY=your-api-key-here
```

Set your MySQL connection details in a .env file (or use defaults):

MYSQL_HOST=localhost
MYSQL_USER=root
MYSQL_PASSWORD=your-password
MYSQL_DATABASE=test_sql_eval

Or export them as environment variables:

export OPENAI_API_KEY=your-api-key-here
export MYSQL_HOST=localhost
export MYSQL_USER=root
export MYSQL_PASSWORD=your-password
export MYSQL_DATABASE=test_sql_eval

Framework Components

1. Response Schema

The framework expects LLM outputs in a structured JSON format:

{
  "create_sql": "CREATE TABLE ...",
  "select_sql": "SELECT ...",
  "explanation": "Optional explanation"
}

2. Validation Pipeline

Payload Validation: Checks JSON structure and required fields
SQL Execution: Validates SQL syntax and executes in MySQL
Semantic Scoring: Uses LLM-as-judge to score relevance (0-100)
Complexity Analysis: Categorizes queries as simple/medium/complex
Accuracy Comparison: Compares to reference SQL when available

3. Evaluation Metrics

Validation Pass Rate: Percentage of queries passing all structural checks
Semantic Score: Average relevance score (0-100)
Error Breakdown: Categorization by error type (payload, execution)
Complexity Performance: Performance by query complexity level
Length Analysis: Error patterns by question length

Usage Workflow

Setup: Configure EvalConfig with your models and settings
Load Data: Use your dataset or the provided sample
Define Prompts: Create prompt strategies to compare
Run Evaluation: Execute the evaluation suite
Analyze Results: Review visualizations and metrics
Iterate: Refine prompts based on results

Extending the Framework

Adding Custom Metrics

You can extend the evaluation pipeline by modifying execute_eval_suite():

def execute_eval_suite(df, generator_instruction, config):
    # ... existing code ...

    # Add your custom metric
    result['custom_metric'] = compute_custom_metric(parsed_output)

    return results

Supporting Other LLM Providers

Modify invoke_llm_generator() to support other providers:

def invoke_llm_generator(instruction, user_input, config):
    # Replace OpenAI client with your provider
    # e.g., Anthropic, Cohere, etc.
    pass

Custom SQL Validation

Extend run_sql_sanity_checks() to add domain-specific validation:

def run_sql_sanity_checks(parsed_response):
    # ... existing checks ...

    # Add custom validation
    if not validate_custom_rule(parsed_response.select_sql):
        return False, "Custom validation failed"

    return True, None

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
repo-header.png		repo-header.png
sql_evaluation_framework.ipynb		sql_evaluation_framework.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM-Generated SQL Query Evaluation Framework

Quick Start

Use Your Own Dataset

Change Models

Customize Prompts

Run Evaluation

What Is Evaluated

Full Customization Guide

1. Using Your Own Dataset

2. Changing Models

3. Customizing Prompts

4. Adjusting Evaluation Settings

Who Can Benefit From This?

Why Is This Framework Needed?

Requirements

Required Packages

Optional Packages

Environment Setup

Framework Components

1. Response Schema

2. Validation Pipeline

3. Evaluation Metrics

Usage Workflow

Extending the Framework

Adding Custom Metrics

Supporting Other LLM Providers

Custom SQL Validation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM-Generated SQL Query Evaluation Framework

Quick Start

Use Your Own Dataset

Change Models

Customize Prompts

Run Evaluation

What Is Evaluated

Full Customization Guide

1. Using Your Own Dataset

2. Changing Models

3. Customizing Prompts

4. Adjusting Evaluation Settings

Who Can Benefit From This?

Why Is This Framework Needed?

Requirements

Required Packages

Optional Packages

Environment Setup

Framework Components

1. Response Schema

2. Validation Pipeline

3. Evaluation Metrics

Usage Workflow

Extending the Framework

Adding Custom Metrics

Supporting Other LLM Providers

Custom SQL Validation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages