Local-first LLM API evaluation platform — 量化降智,LLM 定分枪
eval_752 is a local-first platform to test and compare LLM API providers on standard benchmarks and custom datasets.
It is built for teams who want a trustworthy answer to practical questions such as:
- "Did this provider get worse?"
- "Are two providers serving meaningfully different quality?"
- "Can we reproduce this result later or share it with someone else?"
- "Is the system path healthy before we trust a score?"
The default Docker path now starts an empty but production-real workspace: PostgreSQL, Redis,
FastAPI, Celery, and the frontend come up ready for you to add a real provider, import a real
dataset, and launch a real run. Provider credentials stay in the app database, and workspace-wide
runtime tuning now lives in Settings.
Have you ever felt that the LLM you're talking to has gotten dumber? How do you know you're actually being served the model they claim? Could LLM API providers be quietly swapping in cheaper quants, reduced reasoning effort, smaller models, or entirely different models to cut costs?
Research shows that third-party LLM APIs frequently serve different models than advertised — with performance gaps up to 47% and identity verification failures in 46% of fingerprint tests (Zhang et al., 2026).
Anthropic has stated that they are developing measures to intentionally degrade model performance if they suspect you of distilling their model. How do we know we're not being targeted?
Perplexity was caught performing "Silent Model Substitution" — serving paid users with cheaper models instead of the premium model the user explicitly selected.
Well, they have the ability to lie. We should have the ability to check.
The right to independently evaluate the LLM services you pay for should belong to everyone.
This project is licensed under the GNU Affero General Public License v3.0.
- You may use, study, modify, and redistribute this software, including for commercial use.
- If you distribute modified versions or derivative works, you must license them under AGPL v3 and preserve copyright and license notices.
- If you run a modified version for users over a network, you must make the complete corresponding source code of that modified version available to those users.
- This project is provided without warranty.
Read the full license text in LICENSE.
- Audit LLM APIs before regressions hit production
- Compare providers with identical prompts and scoring
- Export reproducible
.eval752.zipbundles for review and audit - Keep keys and evaluation data on your own infrastructure
- Route any OpenAI-compatible API through a single operator workflow
The current alpha is strongest at the operator workflow:
- connect providers and run smoke tests
- import or build datasets
- launch runs and watch the active runs board
- compare completed runs
- schedule recurring runs
- export reproducible
.eval752.zipbundles
- Multiple LLM Providers — Unified access to OpenAI-compatible APIs via LiteLLM
- Standard Benchmarks — Auto-import datasets from Hugging Face
- Dataset Builder — No-code GUI for custom evaluation datasets
- Real-time Progress — SSE-driven active run updates
- Programmatic + LLM Scoring — Exact match, regex, or LLM-as-judge
- Comparison Dashboard — Side-by-side metrics, charts, and section analysis
- Scheduled Evaluations — RRULE-based recurring runs with timezone awareness
- Export & Share —
.eval752.zipbundles for reproducibility - Internationalization — English, 简体中文, 繁體中文
- Arena Mode — Pairwise comparisons with Elo/Bradley-Terry rankings (coming soon)
Use the single-path Quick Start in
docs/en/getting-started/quick-start.md.
It boots the stack, has you connect a real provider, import a dataset, and launch the first run.
Prerequisites: Docker and Docker Compose
Example compose.yaml for a single-host GHCR deployment:
services:
backend:
image: ghcr.io/t41372/eval_752-backend:${EVAL752_IMAGE_TAG:-latest}
env_file:
- .env
depends_on:
- postgres
- redis
ports:
- "8000:8000"
celery-worker:
image: ghcr.io/t41372/eval_752-celery:${EVAL752_IMAGE_TAG:-latest}
env_file:
- .env
depends_on:
- postgres
- redis
celery-beat:
image: ghcr.io/t41372/eval_752-celery:${EVAL752_IMAGE_TAG:-latest}
command: ["uv", "run", "celery", "-A", "eval_752.workers.app", "beat", "--loglevel=info"]
env_file:
- .env
depends_on:
- postgres
- redis
frontend:
image: ghcr.io/t41372/eval_752-frontend:${EVAL752_IMAGE_TAG:-latest}
depends_on:
- backend
ports:
- "5173:5173"
postgres:
image: postgres:17
env_file:
- .env
volumes:
- pgdata:/var/lib/postgresql/data
redis:
image: redis:7.4-alpine
volumes:
- redisdata:/data
volumes:
pgdata:
redisdata:Use the repo-maintained file directly if you prefer:
git clone https://github.com/t41372/eval_752.git
cd eval_752
cp .env.example .env
docker compose -f docker-compose.ghcr.yml pull
docker compose -f docker-compose.ghcr.yml up -dcp .env.example .env
docker compose up --build -dAfter the stack is online:
- Open
http://localhost:5173 - Add a real provider from
Providers - Run a smoke test with the exact model you plan to evaluate
- Import a dataset from Hugging Face, upload
.eval752.zip, or build one in the UI - Launch a run from
Runs - Adjust workspace-wide timeout and retry policy from
Settingswhen needed
Complete guide: Quick Start
| Quick Start | docs/en/getting-started/quick-start.md |
| 简体中文 | docs/zh/getting-started/quick-start.md |
| Configuration | docs/en/operations/configuration.md |
| Monitoring | docs/en/operations/monitoring.md |
| Troubleshooting | docs/en/getting-started/troubleshooting.md |
API docs are available at http://localhost:8000/docs and http://localhost:8000/redoc once the
stack is running.
eval_752 is under active development. Planned work includes:
- Arena Mode (pairwise + Elo rankings)
- Robustness and anti-prefabrication signals
- Browser Harness capture/import flows
- LLM fingerprinting and model identity verification
Read CONTRIBUTING.md before opening a pull request.
