Skip to content

rainbowgore/llm-mysql-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM-Generated SQL Query Evaluation Framework

Repository Header

Open In Colab

A comprehensive, reusable framework for evaluating natural-language-to-SQL generation systems. This notebook combines automated unit tests, SQL execution validation, and LLM-as-a-judge semantic scoring to systematically evaluate and compare different prompting strategies.

Quick Start

This notebook is designed to be customized for your own text-to-SQL evaluation needs.

Use Your Own Dataset

Replace the dataset loading cell with your data. Minimum requirement: a question column. Optional: sql (reference SQL) and context (schema info) for additional metrics.

# Replace the dataset loading cell with:
df = pd.DataFrame({
    'question': ['Your question 1', 'Your question 2', ...],  # Required
    'sql': ['SELECT ...', 'SELECT ...', ...],                # Optional: reference SQL
    'context': ['CREATE TABLE ...', ...]                     # Optional: schema info
})

# Or load from CSV/JSON:
df = pd.read_csv('your_dataset.csv')

Change Models

In the environment setup cell, modify the EvalConfig:

eval_config = EvalConfig(
    generator_model="gpt-4",      # Your SQL generation model
    judge_model="gpt-4o-mini",    # Your evaluation model
    dataset_slice=100,            # Limit to 100 examples
    mysql_host="localhost",       # MySQL server hostname
    mysql_user="root",            # MySQL username
    mysql_password="your-password", # MySQL password
    mysql_database="test_sql_eval" # MySQL database name
)

Customize Prompts

Modify instruction_strategy_a and instruction_strategy_b in the comparison section, or create your own prompt strategies.

Run Evaluation

Execute cells sequentially. The framework handles the rest!

What Is Evaluated

  1. JSON Structure: Validates that the LLM output conforms to the expected schema
  2. SQL Syntax: Ensures generated SQL can be parsed and executed without errors
  3. Semantic Relevance: Measures how well the SQL answers the original question (0-100 scale)
  4. Query Complexity: Categorizes queries by difficulty (simple/medium/complex)
  5. Execution Accuracy: Compares generated SQL to reference queries when available

Full Customization Guide

1. Using Your Own Dataset

The framework requires at minimum a question column. Additional columns enable more metrics:

  • question (required): Natural language questions to convert to SQL
  • sql (optional): Reference SQL queries for accuracy comparison
  • context (optional): Schema definitions (CREATE TABLE statements) to provide context

Example:

df = pd.DataFrame({
    'question': [
        "What is the total revenue for each product category?",
        "List all customers who have placed more than 5 orders"
    ],
    'sql': [
        "SELECT category, SUM(revenue) FROM products GROUP BY category",
        "SELECT customer_id FROM orders GROUP BY customer_id HAVING COUNT(*) > 5"
    ],
    'context': [
        "CREATE TABLE products (id INT, category TEXT, revenue FLOAT)",
        "CREATE TABLE orders (id INT, customer_id INT, order_date DATE)"
    ]
})

2. Changing Models

Edit the EvalConfig in the environment setup cell:

eval_config = EvalConfig(
    generator_model="gpt-4",           # Model for SQL generation
    judge_model="gpt-4o-mini",         # Model for semantic scoring
    dataset_slice=50,                  # Limit to 50 examples (None = all)
    semantic_score_min=0,              # Minimum semantic score
    semantic_score_max=100,            # Maximum semantic score
    enable_complexity_analysis=True,   # Analyze query complexity
    enable_execution_accuracy=True,    # Compare to reference SQL
    mysql_host="localhost",            # MySQL server hostname
    mysql_user="root",                 # MySQL username
    mysql_password="your-password",     # MySQL password
    mysql_database="test_sql_eval"     # MySQL database name
)

Supported models: Any OpenAI-compatible model (gpt-4, gpt-4o-mini, gpt-3.5-turbo, etc.)

3. Customizing Prompts

Modify the prompt strategies in the "Comparing Prompt Variants" section:

instruction_strategy_a = """Your custom prompt here.

Include instructions for:
- SQL generation requirements
- Output format (JSON with create_sql, select_sql)
- Any domain-specific constraints
"""

instruction_strategy_b = """Alternative prompt strategy for comparison."""

4. Adjusting Evaluation Settings

Control evaluation behavior through EvalConfig:

  • dataset_slice: Limit number of examples (useful for testing)
  • semantic_score_min/max: Adjust scoring scale (default 0-100)
  • enable_complexity_analysis: Toggle query complexity categorization
  • enable_execution_accuracy: Toggle comparison to reference SQL

Who Can Benefit From This?

This framework is valuable for:

  1. Data Engineers & Analysts: Who need to evaluate and improve natural language to SQL conversion systems
  2. ML Engineers: Building or fine-tuning LLM-based SQL generation models who need systematic evaluation metrics
  3. Product Teams: Developing conversational database interfaces or BI tools that generate SQL from user queries
  4. Researchers: Studying prompt engineering strategies and comparing different LLM approaches for code generation
  5. DevOps/QA Teams: Implementing automated testing pipelines for SQL generation systems in production
  6. Anyone building text-to-SQL systems: Who wants a reusable, extensible evaluation framework to measure and improve their system's reliability

Why Is This Framework Needed?

Even when a single example works, you still need systematic evaluation because:

  1. One example doesn't tell the whole story: A model might work on simple queries but fail on complex ones, edge cases, or domain-specific questions
  2. Prompt engineering requires comparison: You need to test multiple prompt strategies to find which one performs best across many examples, not just one
  3. Production reliability: In production, you'll face thousands of queries. Without systematic testing, you won't know your failure rate or which types of queries are problematic
  4. Model updates break things: When you update your LLM, change prompts, or switch models, you need to ensure quality hasn't degraded across your entire test suite
  5. Cost optimization: Different prompts/models have different costs. This framework helps you find the best balance between quality and cost
  6. Debugging at scale: When something goes wrong in production, this framework helps you identify patterns (e.g., "queries with JOINs fail 30% of the time") rather than debugging one-off issues
  7. Confidence in deployment: Before deploying a new prompt or model to production, you need quantitative evidence that it's better than the current version

Real-world scenario: You might test 10 queries manually and they all work, but when you deploy to production with 1000+ daily queries, you discover 15% fail due to SQL syntax errors or semantic mismatches. This framework catches those issues before deployment.

Requirements

Required Packages

pip install openai pandas pydantic numpy matplotlib tqdm python-dotenv pymysql

Optional Packages

pip install datasets      # For loading Hugging Face datasets
pip install seaborn     # For enhanced plotting styles

Environment Setup

  1. Set your OpenAI API key in a .env file:

    OPENAI_API_KEY=your-api-key-here
    
  2. Set your MySQL connection details in a .env file (or use defaults):

    MYSQL_HOST=localhost
    MYSQL_USER=root
    MYSQL_PASSWORD=your-password
    MYSQL_DATABASE=test_sql_eval
    
  3. Or export them as environment variables:

    export OPENAI_API_KEY=your-api-key-here
    export MYSQL_HOST=localhost
    export MYSQL_USER=root
    export MYSQL_PASSWORD=your-password
    export MYSQL_DATABASE=test_sql_eval

Framework Components

1. Response Schema

The framework expects LLM outputs in a structured JSON format:

{
  "create_sql": "CREATE TABLE ...",
  "select_sql": "SELECT ...",
  "explanation": "Optional explanation"
}

2. Validation Pipeline

  1. Payload Validation: Checks JSON structure and required fields
  2. SQL Execution: Validates SQL syntax and executes in MySQL
  3. Semantic Scoring: Uses LLM-as-judge to score relevance (0-100)
  4. Complexity Analysis: Categorizes queries as simple/medium/complex
  5. Accuracy Comparison: Compares to reference SQL when available

3. Evaluation Metrics

  • Validation Pass Rate: Percentage of queries passing all structural checks
  • Semantic Score: Average relevance score (0-100)
  • Error Breakdown: Categorization by error type (payload, execution)
  • Complexity Performance: Performance by query complexity level
  • Length Analysis: Error patterns by question length

Usage Workflow

  1. Setup: Configure EvalConfig with your models and settings
  2. Load Data: Use your dataset or the provided sample
  3. Define Prompts: Create prompt strategies to compare
  4. Run Evaluation: Execute the evaluation suite
  5. Analyze Results: Review visualizations and metrics
  6. Iterate: Refine prompts based on results

Extending the Framework

Adding Custom Metrics

You can extend the evaluation pipeline by modifying execute_eval_suite():

def execute_eval_suite(df, generator_instruction, config):
    # ... existing code ...

    # Add your custom metric
    result['custom_metric'] = compute_custom_metric(parsed_output)

    return results

Supporting Other LLM Providers

Modify invoke_llm_generator() to support other providers:

def invoke_llm_generator(instruction, user_input, config):
    # Replace OpenAI client with your provider
    # e.g., Anthropic, Cohere, etc.
    pass

Custom SQL Validation

Extend run_sql_sanity_checks() to add domain-specific validation:

def run_sql_sanity_checks(parsed_response):
    # ... existing checks ...

    # Add custom validation
    if not validate_custom_rule(parsed_response.select_sql):
        return False, "Custom validation failed"

    return True, None

About

Systematically evaluate text-to-SQL systems for MySQL with automated validation and LLM-based semantic scoring.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors