Open-source evaluation framework for Text-to-SQL systems
Measure how good your Text-to-SQL actually is. Compare LLMs. Find failure patterns.
Most Text-to-SQL demos show cherry-picked examples. sql-eval tells you the truth:
- Exact accuracy — Does the generated SQL match the expected query?
- Structural accuracy — Are the tables, columns, and JOINs correct?
- Execution accuracy — Do both queries return the same results?
- Failure analysis — Where exactly does your system break?
✅ 100% Local Option — Your schema and data never leave your machine (with Ollama)
✅ Multiple LLMs — OpenAI, Anthropic Claude, Ollama, SQLCoder
✅ Bundled Datasets — Start testing in 30 seconds
✅ Custom Datasets — Bring your own schema and questions
✅ Detailed Metrics — Exact match, structural match, execution match
✅ Failure Analysis — Know exactly where your system breaks
pip install sql-eval
# With OpenAI support
pip install sql-eval[openai]
# With all providers
pip install sql-eval[all]# Run with bundled e-commerce dataset
sql-eval run --dataset ecommerce --llm openai
# Use Claude
sql-eval run --dataset ecommerce --llm anthropic
# Fully offline with Ollama (no data leaves your machine)
sql-eval run --dataset ecommerce --llm ollama --model codellama
# Compare multiple LLMs
sql-eval compare --dataset ecommerce --llms openai,anthropic,ollama
# List available datasets
sql-eval listfrom sql_eval import Evaluator
from sql_eval.llm_providers import OpenAIProvider
from sql_eval.datasets import load_ecommerce
# Load dataset
test_cases, schema, db = load_ecommerce(with_db=True)
# Initialize evaluator
provider = OpenAIProvider() # Uses OPENAI_API_KEY env var
evaluator = Evaluator(
llm_provider=provider,
schema=schema,
db_connector=db # Optional: for execution tests
)
# Run evaluation
report = evaluator.evaluate(test_cases)
# View results
print(report.get_summary())
print(f"Exact Match: {report.exact_match_accuracy:.1%}")
print(f"Structural Match: {report.structural_accuracy:.1%}")
# Analyze failures
for pattern in report.common_failure_patterns[:5]:
print(f"- {pattern.pattern_name}: {pattern.count} cases")Your data stays on your machine.
- Schema-only mode: No actual data needed, just table definitions
- Local execution: Use Ollama for 100% offline evaluation
- No telemetry: We don't collect any usage data
# Fully offline evaluation
ollama pull codellama
sql-eval run --dataset ecommerce --llm ollama --model codellama| Provider | Models | Local? | Setup |
|---|---|---|---|
| OpenAI | GPT-4o, GPT-4o-mini | ❌ | export OPENAI_API_KEY=... |
| Anthropic | Claude Sonnet, Claude Haiku | ❌ | export ANTHROPIC_API_KEY=... |
| Ollama | CodeLlama, Llama3, Mistral | ✅ | ollama serve |
| SQLCoder | sqlcoder | ✅ | ollama pull sqlcoder |
| Dataset | Questions | Tables | Difficulty |
|---|---|---|---|
| ecommerce | 35 | 5 | Easy to Hard |
More datasets coming soon (SaaS metrics, HR database).
from sql_eval.core.evaluator import DatasetLoader
test_cases = DatasetLoader.from_csv(
'my_questions.csv',
question_col='question',
sql_col='sql',
difficulty_col='difficulty'
)test_cases = DatasetLoader.from_json('my_questions.json')Expected JSON format:
[
{
"question_id": "Q001",
"question": "How many users signed up last month?",
"sql": "SELECT COUNT(*) FROM users WHERE created_at >= DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month')",
"difficulty": "medium",
"category": "date_functions"
}
]from sql_eval.core.schema_loader import SchemaLoader
# From SQL file
schema = SchemaLoader.from_sql_file('my_schema.sql')
# From DDL string
schema = SchemaLoader.from_ddl("""
CREATE TABLE users (
id INTEGER PRIMARY KEY,
email VARCHAR(255) NOT NULL,
created_at TIMESTAMP
);
""")Generated SQL matches expected SQL after normalization (whitespace, case).
Compares SQL components independently:
- Tables referenced
- Columns selected
- JOIN conditions
- WHERE clauses
- GROUP BY / ORDER BY
- Aggregation functions
Both queries return identical results when run against the database.
# Console (default)
sql-eval run --dataset ecommerce --llm openai
# JSON report
sql-eval run --dataset ecommerce --llm openai --output json
# HTML report
sql-eval run --dataset ecommerce --llm openai --output htmlContributions welcome! Areas where help is needed:
- More bundled datasets (SaaS metrics, HR database, etc.)
- Additional LLM providers (Google Gemini, Cohere, etc.)
- Improved SQL comparison logic
- Documentation improvements
- More test coverage
# Clone the repo
git clone https://github.com/ajaymaurya/sql-eval.git
cd sql-eval
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install in development mode
pip install -e .[dev]
# Run tests
pytest tests/ -v
# Run linting
ruff check .- Fork the repo and create your branch from
main - Add tests for any new functionality
- Ensure all tests pass (
pytest tests/ -v) - Update documentation if needed
- Submit a PR with a clear description
MIT License - see LICENSE for details.
Ajay Maurya - AI Engineer LinkedIn | GitHub
Built with ❤️ for the Text-to-SQL community