LLM quality gates for every PR. Run your @eval_case suites automatically and block merge if quality drops below threshold.
- Zero infrastructure — runs entirely in GitHub Actions
- 2-minute setup
- Works with any LLM provider (OpenAI, Anthropic, Gemini, and 30+ more)
- Posts a formatted results table as a PR comment
- Sets Action outputs for downstream steps
Add .github/workflows/eval.yml to your repo:
name: EvalCI
on:
pull_request:
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: SynapseKit/evalci@v1
with:
path: tests/evals
threshold: "0.80"
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}That's it. EvalCI will:
- Install
synapsekitinto the runner - Discover and run all
@eval_case-decorated functions undertests/evals/ - Post a results table as a PR comment
- Fail the check if any case scores below threshold
# tests/evals/test_rag.py
from synapsekit.testing import eval_case
@eval_case(min_score=0.80, max_cost_usd=0.01, max_latency_ms=3000)
def test_rag_relevancy(eval_context):
result = my_rag_pipeline("What is SynapseKit?")
return eval_context.score_relevancy(result, reference="SynapseKit is a Python library...")
@eval_case(min_score=0.75)
def test_rag_faithfulness(eval_context):
result = my_rag_pipeline("How do I install SynapseKit?")
return eval_context.score_faithfulness(result, context=retrieved_docs)EvalCI posts a comment like this on every PR:
Test Score Cost Latency ✅ test_rag_relevancy 0.850 $0.0050 1200ms ❌ test_rag_faithfulness 0.650 $0.0120 2500ms 1/2 passed · Threshold:
0.80· SynapseKit EvalCI
| Input | Description | Default |
|---|---|---|
path |
Path to eval files or directory | . |
threshold |
Global minimum score (0.0–1.0) | 0.7 |
extras |
pip extras for synapsekit (e.g. openai,anthropic) |
openai |
synapsekit-version |
synapsekit version to install, or latest |
latest |
github-token |
Token for posting PR comments | ${{ github.token }} |
fail-on-regression |
Fail if score regresses vs. baseline | false |
token |
EvalCI backend API token (future) | — |
| Output | Description |
|---|---|
passed |
Number of eval cases that passed |
failed |
Number of eval cases that failed |
total |
Total number of eval cases run |
mean-score |
Mean score across all eval cases |
- uses: SynapseKit/evalci@v1
id: eval
with:
path: tests/evals
- run: |
echo "Passed: ${{ steps.eval.outputs.passed }}/${{ steps.eval.outputs.total }}"
echo "Mean score: ${{ steps.eval.outputs.mean-score }}"- uses: SynapseKit/evalci@v1
with:
extras: "openai,anthropic"
threshold: "0.75"
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}[](https://github.com/{owner}/{repo}/actions/workflows/eval.yml)Full documentation is available at synapsekit.github.io/synapsekit-docs/docs/evalci/overview
| Overview | What EvalCI is and how it works |
| Quickstart | Set up in 5 minutes |
| Writing eval cases | How to write @eval_case functions |
| Action reference | All inputs, outputs, and configuration |
| Examples | RAG, agents, multi-provider workflows |
EvalCI is built on SynapseKit — a Python library for building LLM applications with 30+ provider integrations and a built-in evaluation framework.