Search, cherry-pick, and export examples from public AI evaluation datasets.
Cherry Evals is a platform for discovering and curating custom evaluation collections from public AI benchmark datasets (MMLU, HumanEval, GSM8K, etc.). Instead of writing one-off scripts to filter and convert datasets, use Cherry Evals to:
- Search across multiple datasets with keyword, semantic, and hybrid search
- Cherry-pick individual examples into curated collections
- Export collections to any eval framework format (Langfuse, LangSmith, Inspect AI, JSONL, CSV)
Works for both humans (web UI, CLI) and AI agents (MCP server, REST API).
The API is live at https://cherry-evals-api-480090132755.europe-north1.run.app with 10 datasets and 151K examples ready to search.
# Search for biology questions
curl -X POST https://cherry-evals-api-480090132755.europe-north1.run.app/search \
-H "Content-Type: application/json" \
-d '{"query": "photosynthesis", "limit": 5}'
# List all datasets
curl https://cherry-evals-api-480090132755.europe-north1.run.app/datasets# Clone and install
git clone https://github.com/marinone94/cherry-evals.git
cd cherry-evals
uv sync
# Start infrastructure
docker compose up -d
# Run migrations
uv run alembic upgrade head
# Ingest a dataset
uv run python -m cherry_evals.cli ingest mmlu
# Generate embeddings
uv run python -m cherry_evals.cli embed mmlu
# Start the API
uv run fastapi dev api/main.py| Interface | For | Status |
|---|---|---|
| REST API | Programmatic access | Available |
| CLI | Local operations | Available |
| MCP Server | AI agent integration | Available |
| Web UI | Visual browsing | Available |
cherry-evals/
├── api/ # FastAPI REST API
├── cherry_evals/ # Core package (CLI, ingestion, embeddings)
├── core/ # Business logic (search, convert, export)
├── db/ # Database layer (PostgreSQL, Qdrant)
├── agents/ # LLM-powered features
├── tests/ # Test suite
└── docs/ # Architecture and vision docs
uv run pytest # Run tests
uv run ruff check . # Lint
uv run ruff format . # Format
uv run pre-commit run --all # All checksSee AGENTS.md for the full development guide. See ROADMAP.md for the development roadmap.
Python 3.13 | FastAPI | PostgreSQL | Qdrant | Google Embeddings | Anthropic Claude | Langfuse
Elastic License 2.0 — free to use and modify, but you may not offer it as a competing hosted service.