CAS Evals

Public, reproducible evaluation evidence for the Coding Autopilot System.

CAS Evals runs versioned golden tasks and adversarial prompts against deterministic reference responses. It produces machine-readable quality, safety, cost, and latency evidence without secrets, model-provider accounts, or network access.

Why This Exists

AI engineering claims are weak without reproducible evidence. This repository provides a small evaluation kernel that makes benchmark inputs, thresholds, scoring logic, fixture digests, and pass/fail decisions reviewable.

Quickstart

python -m pip install -e .
python -m unittest discover -s tests -v
python -m cas_evals.cli benchmarks/v0.2/golden.json --output artifacts/golden.json
python -m cas_evals.cli benchmarks/v0.2/adversarial.json --output artifacts/adversarial.json
python -m cas_evals.release --check

The CLI exits non-zero when any mandatory metric fails, making each suite usable as a CI regression gate.

Windows users can run the complete verification path with .\scripts\verify.ps1. The checked-in v0.2 benchmark report and releases/v0.2.0/ artifacts record the reproducible public baseline.

Metrics

Metric	v0.1 evidence	Gate
Quality	Fraction of expected concepts present	Configured minimum
Safety	Absence of prohibited unsafe content	Mandatory 100%
Cost	Fixture-supplied normalized USD value	Configured maximum
Latency	Fixture-supplied milliseconds	Configured maximum

Cost and latency are fixture-supplied in v0.1 so results remain deterministic. Future isolated provider adapters will record measured values with explicit provenance.

Evidence Contract

Every per-case result conforms to the published cas-contracts EvaluationResult v0.1.0 schema. The exact tagged shared schemas and immutable provenance are vendored under vendor/cas-contracts/v0.1.0/, so validation remains offline and standalone. Suite evidence adds fixture SHA-256 digests, independent threshold details, and mandatory pass/fail decisions.

See schemas/evaluation-suite.schema.json and vendor/cas-contracts/v0.1.0/provenance.json.

Repository Layout

benchmarks/v0.2/       Representative golden and adversarial fixtures
releases/v0.2.0/       Reproducible benchmark release artifacts
schemas/               Machine-readable suite evidence contract
vendor/cas-contracts/  Pinned published shared contracts
src/cas_evals/         Pure evaluator and CLI
tests/                 Determinism, safety, and CLI contract tests
.planning/             GSD project context, research, requirements, roadmap

Roadmap

Consume shared cas-contracts schemas and expand the public corpus.
Add isolated opt-in live-provider adapters with redaction and cost controls.
Add repeated-run statistics, signed reports, and longitudinal trends.

See SECURITY.md and CONTRIBUTING.md.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github		.github
.planning		.planning
benchmarks		benchmarks
docs		docs
releases/v0.2.0		releases/v0.2.0
schemas		schemas
scripts		scripts
src/cas_evals		src/cas_evals
tests		tests
vendor/cas-contracts/v0.1.0		vendor/cas-contracts/v0.1.0
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CAS Evals

Why This Exists

Quickstart

Metrics

Evidence Contract

Repository Layout

Roadmap

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CAS Evals

Why This Exists

Quickstart

Metrics

Evidence Contract

Repository Layout

Roadmap

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages