cherry-evals

Search, cherry-pick, and export examples from public AI evaluation datasets.

What is this?

Cherry Evals is a platform for discovering and curating custom evaluation collections from public AI benchmark datasets (MMLU, HumanEval, GSM8K, etc.). Instead of writing one-off scripts to filter and convert datasets, use Cherry Evals to:

Search across multiple datasets with keyword, semantic, and hybrid search
Cherry-pick individual examples into curated collections
Export collections to any eval framework format (Langfuse, LangSmith, Inspect AI, JSONL, CSV)

Works for both humans (web UI, CLI) and AI agents (MCP server, REST API).

Quick Start

Use the Cloud API (no setup)

The API is live at https://cherry-evals-api-480090132755.europe-north1.run.app with 10 datasets and 151K examples ready to search.

# Search for biology questions
curl -X POST https://cherry-evals-api-480090132755.europe-north1.run.app/search \
  -H "Content-Type: application/json" \
  -d '{"query": "photosynthesis", "limit": 5}'

# List all datasets
curl https://cherry-evals-api-480090132755.europe-north1.run.app/datasets

Self-hosted

# Clone and install
git clone https://github.com/marinone94/cherry-evals.git
cd cherry-evals
uv sync

# Start infrastructure
docker compose up -d

# Run migrations
uv run alembic upgrade head

# Ingest a dataset
uv run python -m cherry_evals.cli ingest mmlu

# Generate embeddings
uv run python -m cherry_evals.cli embed mmlu

# Start the API
uv run fastapi dev api/main.py

Interfaces

Interface	For	Status
REST API	Programmatic access	Available
CLI	Local operations	Available
MCP Server	AI agent integration	Available
Web UI	Visual browsing	Available

Project Structure

cherry-evals/
├── api/                # FastAPI REST API
├── cherry_evals/       # Core package (CLI, ingestion, embeddings)
├── core/               # Business logic (search, convert, export)
├── db/                 # Database layer (PostgreSQL, Qdrant)
├── agents/             # LLM-powered features
├── tests/              # Test suite
└── docs/               # Architecture and vision docs

Development

uv run pytest                  # Run tests
uv run ruff check .            # Lint
uv run ruff format .           # Format
uv run pre-commit run --all    # All checks

See AGENTS.md for the full development guide. See ROADMAP.md for the development roadmap.

Tech Stack

License

Elastic License 2.0 — free to use and modify, but you may not offer it as a competing hosted service.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.github/workflows		.github/workflows
agents		agents
alembic		alembic
api		api
cherry_evals		cherry_evals
core		core
db		db
deploy/helm/cherry-evals		deploy/helm/cherry-evals
docs		docs
frontend		frontend
landing		landing
mcp_server		mcp_server
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
ROADMAP.md		ROADMAP.md
SKILLS.md		SKILLS.md
alembic.ini		alembic.ini
docker-compose.yml		docker-compose.yml
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cherry-evals

What is this?

Quick Start

Use the Cloud API (no setup)

Self-hosted

Interfaces

Project Structure

Development

Tech Stack

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

cherry-evals

What is this?

Quick Start

Use the Cloud API (no setup)

Self-hosted

Interfaces

Project Structure

Development

Tech Stack

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages