sql-eval 🎯

Open-source evaluation framework for Text-to-SQL systems

Measure how good your Text-to-SQL actually is. Compare LLMs. Find failure patterns.

Why sql-eval?

Most Text-to-SQL demos show cherry-picked examples. sql-eval tells you the truth:

Exact accuracy — Does the generated SQL match the expected query?
Structural accuracy — Are the tables, columns, and JOINs correct?
Execution accuracy — Do both queries return the same results?
Failure analysis — Where exactly does your system break?

Features

✅ 100% Local Option — Your schema and data never leave your machine (with Ollama)
✅ Multiple LLMs — OpenAI, Anthropic Claude, Ollama, SQLCoder
✅ Bundled Datasets — Start testing in 30 seconds
✅ Custom Datasets — Bring your own schema and questions
✅ Detailed Metrics — Exact match, structural match, execution match
✅ Failure Analysis — Know exactly where your system breaks

Quick Start

Installation

pip install sql-eval

# With OpenAI support
pip install sql-eval[openai]

# With all providers
pip install sql-eval[all]

Basic Usage

# Run with bundled e-commerce dataset
sql-eval run --dataset ecommerce --llm openai

# Use Claude
sql-eval run --dataset ecommerce --llm anthropic

# Fully offline with Ollama (no data leaves your machine)
sql-eval run --dataset ecommerce --llm ollama --model codellama

# Compare multiple LLMs
sql-eval compare --dataset ecommerce --llms openai,anthropic,ollama

# List available datasets
sql-eval list

Python API

from sql_eval import Evaluator
from sql_eval.llm_providers import OpenAIProvider
from sql_eval.datasets import load_ecommerce

# Load dataset
test_cases, schema, db = load_ecommerce(with_db=True)

# Initialize evaluator
provider = OpenAIProvider()  # Uses OPENAI_API_KEY env var
evaluator = Evaluator(
    llm_provider=provider,
    schema=schema,
    db_connector=db  # Optional: for execution tests
)

# Run evaluation
report = evaluator.evaluate(test_cases)

# View results
print(report.get_summary())
print(f"Exact Match: {report.exact_match_accuracy:.1%}")
print(f"Structural Match: {report.structural_accuracy:.1%}")

# Analyze failures
for pattern in report.common_failure_patterns[:5]:
    print(f"- {pattern.pattern_name}: {pattern.count} cases")

Privacy & Security

Your data stays on your machine.

Schema-only mode: No actual data needed, just table definitions
Local execution: Use Ollama for 100% offline evaluation
No telemetry: We don't collect any usage data

# Fully offline evaluation
ollama pull codellama
sql-eval run --dataset ecommerce --llm ollama --model codellama

Supported LLM Providers

Provider	Models	Local?	Setup
OpenAI	GPT-4o, GPT-4o-mini	❌	`export OPENAI_API_KEY=...`
Anthropic	Claude Sonnet, Claude Haiku	❌	`export ANTHROPIC_API_KEY=...`
Ollama	CodeLlama, Llama3, Mistral	✅	`ollama serve`
SQLCoder	sqlcoder	✅	`ollama pull sqlcoder`

Bundled Datasets

Dataset	Questions	Tables	Difficulty
ecommerce	35	5	Easy to Hard

More datasets coming soon (SaaS metrics, HR database).

Custom Datasets

From CSV

from sql_eval.core.evaluator import DatasetLoader

test_cases = DatasetLoader.from_csv(
    'my_questions.csv',
    question_col='question',
    sql_col='sql',
    difficulty_col='difficulty'
)

From JSON

test_cases = DatasetLoader.from_json('my_questions.json')

Expected JSON format:

[
  {
    "question_id": "Q001",
    "question": "How many users signed up last month?",
    "sql": "SELECT COUNT(*) FROM users WHERE created_at >= DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month')",
    "difficulty": "medium",
    "category": "date_functions"
  }
]

Custom Schema

from sql_eval.core.schema_loader import SchemaLoader

# From SQL file
schema = SchemaLoader.from_sql_file('my_schema.sql')

# From DDL string
schema = SchemaLoader.from_ddl("""
    CREATE TABLE users (
        id INTEGER PRIMARY KEY,
        email VARCHAR(255) NOT NULL,
        created_at TIMESTAMP
    );
""")

Metrics Explained

Exact Match

Generated SQL matches expected SQL after normalization (whitespace, case).

Structural Match

Compares SQL components independently:

Tables referenced
Columns selected
JOIN conditions
WHERE clauses
GROUP BY / ORDER BY
Aggregation functions

Execution Match

Both queries return identical results when run against the database.

Output Formats

# Console (default)
sql-eval run --dataset ecommerce --llm openai

# JSON report
sql-eval run --dataset ecommerce --llm openai --output json

# HTML report
sql-eval run --dataset ecommerce --llm openai --output html

Contributing

Contributions welcome! Areas where help is needed:

More bundled datasets (SaaS metrics, HR database, etc.)
Additional LLM providers (Google Gemini, Cohere, etc.)
Improved SQL comparison logic
Documentation improvements
More test coverage

Development Setup

# Clone the repo
git clone https://github.com/ajaymaurya/sql-eval.git
cd sql-eval

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in development mode
pip install -e .[dev]

# Run tests
pytest tests/ -v

# Run linting
ruff check .

Pull Request Process

Fork the repo and create your branch from main
Add tests for any new functionality
Ensure all tests pass (pytest tests/ -v)
Update documentation if needed
Submit a PR with a clear description

License

MIT License - see LICENSE for details.

Author

Ajay Maurya - AI Engineer LinkedIn | GitHub

Built with ❤️ for the Text-to-SQL community

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
examples		examples
sql_eval		sql_eval
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sql-eval 🎯

Why sql-eval?

Features

Quick Start

Installation

Basic Usage

Python API

Privacy & Security

Supported LLM Providers

Bundled Datasets

Custom Datasets

From CSV

From JSON

Custom Schema

Metrics Explained

Exact Match

Structural Match

Execution Match

Output Formats

Contributing

Development Setup

Pull Request Process

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

sql-eval 🎯

Why sql-eval?

Features

Quick Start

Installation

Basic Usage

Python API

Privacy & Security

Supported LLM Providers

Bundled Datasets

Custom Datasets

From CSV

From JSON

Custom Schema

Metrics Explained

Exact Match

Structural Match

Execution Match

Output Formats

Contributing

Development Setup

Pull Request Process

License

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages