Skip to content

t41372/Eval_752

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

271 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

eval_752 logo

Local-first LLM API evaluation platform — 量化降智,LLM 定分枪

CI Docs Docker Integration GHCR Images
License Python 3.12+ Node 20+ Ruff Conventional Commits
GitHub Stars GitHub Issues PRs Welcome


eval_752 is a local-first platform to test and compare LLM API providers on standard benchmarks and custom datasets.

It is built for teams who want a trustworthy answer to practical questions such as:

  • "Did this provider get worse?"
  • "Are two providers serving meaningfully different quality?"
  • "Can we reproduce this result later or share it with someone else?"
  • "Is the system path healthy before we trust a score?"

The default Docker path now starts an empty but production-real workspace: PostgreSQL, Redis, FastAPI, Celery, and the frontend come up ready for you to add a real provider, import a real dataset, and launch a real run. Provider credentials stay in the app database, and workspace-wide runtime tuning now lives in Settings.

Have you ever felt that the LLM you're talking to has gotten dumber? How do you know you're actually being served the model they claim? Could LLM API providers be quietly swapping in cheaper quants, reduced reasoning effort, smaller models, or entirely different models to cut costs?

Research shows that third-party LLM APIs frequently serve different models than advertised — with performance gaps up to 47% and identity verification failures in 46% of fingerprint tests (Zhang et al., 2026).

Anthropic has stated that they are developing measures to intentionally degrade model performance if they suspect you of distilling their model. How do we know we're not being targeted?

Perplexity was caught performing "Silent Model Substitution" — serving paid users with cheaper models instead of the premium model the user explicitly selected.

Well, they have the ability to lie. We should have the ability to check.

The right to independently evaluate the LLM services you pay for should belong to everyone.

License Notice

This project is licensed under the GNU Affero General Public License v3.0.

  • You may use, study, modify, and redistribute this software, including for commercial use.
  • If you distribute modified versions or derivative works, you must license them under AGPL v3 and preserve copyright and license notices.
  • If you run a modified version for users over a network, you must make the complete corresponding source code of that modified version available to those users.
  • This project is provided without warranty.

Read the full license text in LICENSE.

Why eval_752?

  • Audit LLM APIs before regressions hit production
  • Compare providers with identical prompts and scoring
  • Export reproducible .eval752.zip bundles for review and audit
  • Keep keys and evaluation data on your own infrastructure
  • Route any OpenAI-compatible API through a single operator workflow

What Works Today

The current alpha is strongest at the operator workflow:

  • connect providers and run smoke tests
  • import or build datasets
  • launch runs and watch the active runs board
  • compare completed runs
  • schedule recurring runs
  • export reproducible .eval752.zip bundles

Features

  • Multiple LLM Providers — Unified access to OpenAI-compatible APIs via LiteLLM
  • Standard Benchmarks — Auto-import datasets from Hugging Face
  • Dataset Builder — No-code GUI for custom evaluation datasets
  • Real-time Progress — SSE-driven active run updates
  • Programmatic + LLM Scoring — Exact match, regex, or LLM-as-judge
  • Comparison Dashboard — Side-by-side metrics, charts, and section analysis
  • Scheduled Evaluations — RRULE-based recurring runs with timezone awareness
  • Export & Share — .eval752.zip bundles for reproducibility
  • Internationalization — English, 简体中文, 繁體中文
  • Arena Mode — Pairwise comparisons with Elo/Bradley-Terry rankings (coming soon)

Quick Start

Use the single-path Quick Start in docs/en/getting-started/quick-start.md. It boots the stack, has you connect a real provider, import a dataset, and launch the first run.

Prerequisites: Docker and Docker Compose

Option A: Run the prebuilt GHCR stack

Example compose.yaml for a single-host GHCR deployment:

services:
  backend:
    image: ghcr.io/t41372/eval_752-backend:${EVAL752_IMAGE_TAG:-latest}
    env_file:
      - .env
    depends_on:
      - postgres
      - redis
    ports:
      - "8000:8000"

  celery-worker:
    image: ghcr.io/t41372/eval_752-celery:${EVAL752_IMAGE_TAG:-latest}
    env_file:
      - .env
    depends_on:
      - postgres
      - redis

  celery-beat:
    image: ghcr.io/t41372/eval_752-celery:${EVAL752_IMAGE_TAG:-latest}
    command: ["uv", "run", "celery", "-A", "eval_752.workers.app", "beat", "--loglevel=info"]
    env_file:
      - .env
    depends_on:
      - postgres
      - redis

  frontend:
    image: ghcr.io/t41372/eval_752-frontend:${EVAL752_IMAGE_TAG:-latest}
    depends_on:
      - backend
    ports:
      - "5173:5173"

  postgres:
    image: postgres:17
    env_file:
      - .env
    volumes:
      - pgdata:/var/lib/postgresql/data

  redis:
    image: redis:7.4-alpine
    volumes:
      - redisdata:/data

volumes:
  pgdata:
  redisdata:

Use the repo-maintained file directly if you prefer:

git clone https://github.com/t41372/eval_752.git
cd eval_752
cp .env.example .env
docker compose -f docker-compose.ghcr.yml pull
docker compose -f docker-compose.ghcr.yml up -d

Option B: Build from source locally

cp .env.example .env
docker compose up --build -d

After the stack is online:

  1. Open http://localhost:5173
  2. Add a real provider from Providers
  3. Run a smoke test with the exact model you plan to evaluate
  4. Import a dataset from Hugging Face, upload .eval752.zip, or build one in the UI
  5. Launch a run from Runs
  6. Adjust workspace-wide timeout and retry policy from Settings when needed

Complete guide: Quick Start

Documentation

Quick Start docs/en/getting-started/quick-start.md
简体中文 docs/zh/getting-started/quick-start.md
Configuration docs/en/operations/configuration.md
Monitoring docs/en/operations/monitoring.md
Troubleshooting docs/en/getting-started/troubleshooting.md

API docs are available at http://localhost:8000/docs and http://localhost:8000/redoc once the stack is running.

Roadmap

eval_752 is under active development. Planned work includes:

  • Arena Mode (pairwise + Elo rankings)
  • Robustness and anti-prefabrication signals
  • Browser Harness capture/import flows
  • LLM fingerprinting and model identity verification

Contributing

Read CONTRIBUTING.md before opening a pull request.

License

GNU AGPL v3.0

About

[work in progress] Evaluating LLM APIs, quantify dumbing down. 量化降智,LLM API 定分枪

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors