GitHub - t41372/Eval_752: [work in progress] Evaluating LLM APIs, quantify dumbing down. 量化降智，LLM API 定分枪

Local-first LLM API evaluation platform — 量化降智，LLM 定分枪

eval_752 is a local-first platform to test and compare LLM API providers on standard benchmarks and custom datasets.

It is built for teams who want a trustworthy answer to practical questions such as:

"Did this provider get worse?"
"Are two providers serving meaningfully different quality?"
"Can we reproduce this result later or share it with someone else?"
"Is the system path healthy before we trust a score?"

The default Docker path now starts an empty but production-real workspace: PostgreSQL, Redis, FastAPI, Celery, and the frontend come up ready for you to add a real provider, import a real dataset, and launch a real run. Provider credentials stay in the app database, and workspace-wide runtime tuning now lives in Settings.

Have you ever felt that the LLM you're talking to has gotten dumber? How do you know you're actually being served the model they claim? Could LLM API providers be quietly swapping in cheaper quants, reduced reasoning effort, smaller models, or entirely different models to cut costs?

Research shows that third-party LLM APIs frequently serve different models than advertised — with performance gaps up to 47% and identity verification failures in 46% of fingerprint tests (Zhang et al., 2026).

Anthropic has stated that they are developing measures to intentionally degrade model performance if they suspect you of distilling their model. How do we know we're not being targeted?

Perplexity was caught performing "Silent Model Substitution" — serving paid users with cheaper models instead of the premium model the user explicitly selected.

Well, they have the ability to lie. We should have the ability to check.

The right to independently evaluate the LLM services you pay for should belong to everyone.

License Notice

This project is licensed under the GNU Affero General Public License v3.0.

You may use, study, modify, and redistribute this software, including for commercial use.
If you distribute modified versions or derivative works, you must license them under AGPL v3 and preserve copyright and license notices.
If you run a modified version for users over a network, you must make the complete corresponding source code of that modified version available to those users.
This project is provided without warranty.

Read the full license text in LICENSE.

Why eval_752?

Audit LLM APIs before regressions hit production
Compare providers with identical prompts and scoring
Export reproducible .eval752.zip bundles for review and audit
Keep keys and evaluation data on your own infrastructure
Route any OpenAI-compatible API through a single operator workflow

What Works Today

The current alpha is strongest at the operator workflow:

connect providers and run smoke tests
import or build datasets
launch runs and watch the active runs board
compare completed runs
schedule recurring runs
export reproducible .eval752.zip bundles

Features

Multiple LLM Providers — Unified access to OpenAI-compatible APIs via LiteLLM
Standard Benchmarks — Auto-import datasets from Hugging Face
Dataset Builder — No-code GUI for custom evaluation datasets
Real-time Progress — SSE-driven active run updates
Programmatic + LLM Scoring — Exact match, regex, or LLM-as-judge
Comparison Dashboard — Side-by-side metrics, charts, and section analysis
Scheduled Evaluations — RRULE-based recurring runs with timezone awareness
Export & Share — .eval752.zip bundles for reproducibility
Internationalization — English, 简体中文, 繁體中文
Arena Mode — Pairwise comparisons with Elo/Bradley-Terry rankings (coming soon)

Quick Start

Use the single-path Quick Start in docs/en/getting-started/quick-start.md. It boots the stack, has you connect a real provider, import a dataset, and launch the first run.

Prerequisites: Docker and Docker Compose

Option A: Run the prebuilt GHCR stack

Example compose.yaml for a single-host GHCR deployment:

services:
  backend:
    image: ghcr.io/t41372/eval_752-backend:${EVAL752_IMAGE_TAG:-latest}
    env_file:
      - .env
    depends_on:
      - postgres
      - redis
    ports:
      - "8000:8000"

  celery-worker:
    image: ghcr.io/t41372/eval_752-celery:${EVAL752_IMAGE_TAG:-latest}
    env_file:
      - .env
    depends_on:
      - postgres
      - redis

  celery-beat:
    image: ghcr.io/t41372/eval_752-celery:${EVAL752_IMAGE_TAG:-latest}
    command: ["uv", "run", "celery", "-A", "eval_752.workers.app", "beat", "--loglevel=info"]
    env_file:
      - .env
    depends_on:
      - postgres
      - redis

  frontend:
    image: ghcr.io/t41372/eval_752-frontend:${EVAL752_IMAGE_TAG:-latest}
    depends_on:
      - backend
    ports:
      - "5173:5173"

  postgres:
    image: postgres:17
    env_file:
      - .env
    volumes:
      - pgdata:/var/lib/postgresql/data

  redis:
    image: redis:7.4-alpine
    volumes:
      - redisdata:/data

volumes:
  pgdata:
  redisdata:

Use the repo-maintained file directly if you prefer:

git clone https://github.com/t41372/eval_752.git
cd eval_752
cp .env.example .env
docker compose -f docker-compose.ghcr.yml pull
docker compose -f docker-compose.ghcr.yml up -d

Option B: Build from source locally

cp .env.example .env
docker compose up --build -d

After the stack is online:

Open http://localhost:5173
Add a real provider from Providers
Run a smoke test with the exact model you plan to evaluate
Import a dataset from Hugging Face, upload .eval752.zip, or build one in the UI
Launch a run from Runs
Adjust workspace-wide timeout and retry policy from Settings when needed

Complete guide: Quick Start

Documentation


Quick Start	docs/en/getting-started/quick-start.md
简体中文	docs/zh/getting-started/quick-start.md
Configuration	docs/en/operations/configuration.md
Monitoring	docs/en/operations/monitoring.md
Troubleshooting	docs/en/getting-started/troubleshooting.md

API docs are available at http://localhost:8000/docs and http://localhost:8000/redoc once the stack is running.

Roadmap

eval_752 is under active development. Planned work includes:

Arena Mode (pairwise + Elo rankings)
Robustness and anti-prefabrication signals
Browser Harness capture/import flows
LLM fingerprinting and model identity verification

Contributing

Read CONTRIBUTING.md before opening a pull request.

License

GNU AGPL v3.0

Name		Name	Last commit message	Last commit date
Latest commit History 271 Commits
.github/workflows		.github/workflows
backend		backend
docker		docker
docs		docs
frontend		frontend
locales		locales
scripts		scripts
specs		specs
.dockerignore		.dockerignore
.env.example		.env.example
.env.integration.example		.env.integration.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
check.py		check.py
commitlint.config.cjs		commitlint.config.cjs
docker-compose.build.yml		docker-compose.build.yml
docker-compose.ghcr.yml		docker-compose.ghcr.yml
docker-compose.integration.yml		docker-compose.integration.yml
docker-compose.override.example.yml		docker-compose.override.example.yml
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

License Notice

Why eval_752?

What Works Today

Features

Quick Start

Option A: Run the prebuilt GHCR stack

Option B: Build from source locally

Documentation

Roadmap

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

License Notice

Why eval_752?

What Works Today

Features

Quick Start

Option A: Run the prebuilt GHCR stack

Option B: Build from source locally

Documentation

Roadmap

Contributing

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages