Skip to content

Overview

gaafa edited this page Apr 9, 2026 · 1 revision

Overview

EvalCI is a GitHub Action that runs your @eval_case suites on every pull request and blocks merge if quality drops below threshold.

No infrastructure. No backend. 2-minute setup.

→ Full docs: https://synapsekit.github.io/synapsekit-docs/docs/evalci/overview


Why EvalCI

LLM applications degrade silently. A prompt change, a model update, a retrieval tweak — any of these can drop quality by 10–20% without a single test failure. EvalCI gives you a quality gate that catches this before it ships.

Without EvalCI With EvalCI
Quality regressions ship to production Blocked at PR review
Manual eval runs, inconsistent Automatic on every PR
No visibility into cost/latency trends Score, cost, latency per case on every PR
Requires external tooling Works in your existing GitHub Actions

How it works

  1. pip install synapsekit[{extras}] on the runner
  2. synapsekit test {path} --format json --threshold {threshold} — discovers all @eval_case functions, runs them, outputs JSON
  3. Parse JSON results
  4. Post results table as a PR comment
  5. Set Action outputs: passed, failed, total, mean-score
  6. Exit 0 (all pass) or 1 (any failure)

PR comment

## EvalCI Results

|   | Test                   | Score | Cost    | Latency |
|---|------------------------|-------|---------|---------|
| ✅ | eval_rag_relevancy     | 0.850 | $0.0050 | 1200ms  |
| ❌ | eval_rag_faithfulness  | 0.650 | $0.0120 | 2500ms  |

**1/2 passed** · Threshold: `0.80`

File discovery

EvalCI discovers files matching eval_*.py or *_eval.py recursively under path.

tests/
└── evals/
    ├── eval_rag.py        ✅ discovered
    ├── eval_agents.py     ✅ discovered
    ├── rag_eval.py        ✅ discovered
    └── test_rag.py        ❌ not discovered

Next steps

Clone this wiki locally