Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
*.json text eol=lf
5 changes: 3 additions & 2 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,5 +22,6 @@ jobs:
python-version: ${{ matrix.python-version }}
- run: python -m pip install -e .
- run: python -m unittest discover -s tests -v
- run: python -m cas_evals.cli benchmarks/v0.1/golden.json
- run: python -m cas_evals.cli benchmarks/v0.1/adversarial.json
- run: python -m cas_evals.cli benchmarks/v0.2/golden.json
- run: python -m cas_evals.cli benchmarks/v0.2/adversarial.json
- run: python -m cas_evals.release --check
10 changes: 8 additions & 2 deletions .planning/PROJECT.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,11 @@ Every CAS capability claim can be reproduced from versioned fixtures and machine

### Validated

(None yet - ship to validate)
- [x] Shared evaluation results consume provenance-pinned published `cas-contracts` schemas offline.
- [x] The public corpus represents core CAS engineering workflows and independent safety risks.
- [x] Benchmark release artifacts regenerate byte-identically with machine-readable provenance.

Validated in Phase 2: Shared Contracts And Corpus.

### Active

Expand Down Expand Up @@ -50,7 +54,9 @@ CAS needs measurable proof that its prompt refinement, autonomous engineering, a

This document evolves at phase transitions and milestone boundaries.

Phase 2 is complete. The repository now consumes shared contracts offline, runs a representative v0.2 corpus, and publishes reproducible v0.2.0 release evidence. Phase 3 adds isolated opt-in live adapters.

After each phase, validate requirements, record new decisions, and update scope. After each milestone, review the core value, exclusions, and evidence quality.

---
*Last updated: 2026-06-11 after initialization*
*Last updated: 2026-06-11 after Phase 2 completion*
9 changes: 8 additions & 1 deletion .planning/REQUIREMENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,12 @@
- [ ] **GOV-01**: Contributor can run documented tests and benchmarks without secrets.
- [ ] **GOV-02**: Repository publishes security and contribution guidance.

### Shared Contracts And Corpus

- [x] **SHRD-01**: Maintainer can validate emitted evaluation results against a provenance-pinned published `cas-contracts` schema without network access.
- [x] **CORP-01**: User can run a representative golden-task corpus covering core CAS engineering workflows.
- [x] **REL-01**: Maintainer can deterministically generate and publish reviewable benchmark release artifacts.

## v2 Requirements

- **LIVE-01**: User can evaluate live model-provider responses through isolated adapters.
Expand All @@ -51,8 +57,9 @@
| METR-01, METR-02, METR-03, METR-04 | Phase 1 | Complete |
| EVID-01, EVID-02, EVID-03 | Phase 1 | Complete |
| GOV-01, GOV-02 | Phase 1 | Complete |
| SHRD-01, CORP-01, REL-01 | Phase 2 | Complete |

**Coverage:** 12 v1 requirements, 12 mapped, 0 unmapped.
**Coverage:** 15 v1 requirements, 15 mapped, 0 unmapped.

---
*Last updated: 2026-06-11 after v0.1 scaffold*
11 changes: 11 additions & 0 deletions .planning/ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,21 @@

Consume versioned `cas-contracts` schemas, expand representative CAS golden tasks, and publish benchmark release artifacts.

**Status:** Complete (2026-06-11)

## Phase 3: Isolated Live Adapters

Add opt-in provider adapters with redaction, managed identity where applicable, cost controls, and recorded provenance.

## Phase 4: Statistical And Longitudinal Evidence

Add repeated-run statistics, baseline comparison, regression budgets, signed reports, and a public trend dashboard.

## Progress

| Phase | Status | Completed |
|-------|--------|-----------|
| 1. Reproducible Evaluation Kernel | Complete | 2026-06-11 |
| 2. Shared Contracts And Corpus | Complete | 2026-06-11 |
| 3. Isolated Live Adapters | Pending | - |
| 4. Statistical And Longitudinal Evidence | Pending | - |
17 changes: 16 additions & 1 deletion .planning/STATE.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,26 @@
---
gsd_state_version: 1.0
milestone: v0.1
milestone_name: milestone
status: ready_to_plan
last_updated: 2026-06-11T11:09:44.109Z
progress:
total_phases: 4
completed_phases: 2
total_plans: 3
completed_plans: 3
percent: 50
stopped_at: Phase 2 complete (3/3) — ready to discuss Phase 3
---

# Project State

## Project Reference

See: `.planning/PROJECT.md` (updated 2026-06-11)

**Core value:** Every CAS capability claim can be reproduced from versioned fixtures and machine-readable results.
**Current focus:** Phase 1 complete; prepare shared-contract integration.
**Current focus:** Phase 3 — isolated live adapters

## Status

Expand Down
2 changes: 1 addition & 1 deletion .planning/config.json
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,6 @@
"verifier": true,
"nyquist_validation": true,
"auto_advance": true,
"_auto_chain_active": true
"_auto_chain_active": false
}
}
49 changes: 49 additions & 0 deletions .planning/phases/02-shared-contracts-and-corpus/02-01-PLAN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
---
phase: 02-shared-contracts-and-corpus
plan: "01"
type: execute
wave: 1
depends_on: []
files_modified:
- vendor/cas-contracts/v0.1.0/common.schema.json
- vendor/cas-contracts/v0.1.0/evaluation-result.schema.json
- vendor/cas-contracts/v0.1.0/provenance.json
- src/cas_evals/contracts.py
- src/cas_evals/evaluator.py
- tests/test_contracts.py
- tests/test_evaluator.py
requirements: [SHRD-01]
autonomous: true
must_haves:
truths:
- "Per-case results validate against the published shared EvaluationResult contract."
- "Contract validation requires no network, secrets, or third-party runtime packages."
- "Vendored schema provenance is verified."
---

<objective>
Consume the published shared evaluation contract and align deterministic evaluator output.
</objective>

<tasks>
<task type="auto">
<name>Vendor and verify published shared schemas</name>
<read_first>AGENTS.md, .planning/phases/02-shared-contracts-and-corpus/02-CONTEXT.md</read_first>
<action>Vendor exact v0.1.0 common and evaluation-result schemas with immutable provenance. Add a standard-library validator that checks provenance and the current schema constraint surface.</action>
<acceptance_criteria>Contract tests pass offline and reject malformed results.</acceptance_criteria>
</task>
<task type="auto">
<name>Align evaluator output to shared contract</name>
<read_first>src/cas_evals/evaluator.py, tests/test_evaluator.py</read_first>
<action>Emit exact shared EvaluationResult objects and preserve detailed mandatory-gate evidence in the suite envelope.</action>
<acceptance_criteria>Existing and new evaluator tests pass; safety remains independently mandatory.</acceptance_criteria>
</task>
</tasks>

<verification>
python -m unittest discover -s tests -v
</verification>

<success_criteria>
Shared contract consumption is pinned, offline, tested, and used by evaluator output.
</success_criteria>
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
---
phase: 02-shared-contracts-and-corpus
plan: "01"
status: complete
completed: 2026-06-11
requirements: [SHRD-01]
---

# Plan 02-01 Summary

Vendored the immutable `cas-contracts` v0.1.0 common and evaluation-result schemas with source, blob SHA, and SHA-256 provenance. Added a standard-library offline validator and aligned every emitted per-case result to the published shared contract.

Detailed thresholds, fixture digests, and mandatory gate decisions remain in the suite evidence envelope so shared results reject local extensions while safety remains independently mandatory.

## Verification

- `python -m unittest discover -s tests -v` - 12 tests passed.
- `python -m cas_evals.cli benchmarks/v0.1/golden.json --output artifacts/golden.json` - passed.
- `python -m cas_evals.cli benchmarks/v0.1/adversarial.json --output artifacts/adversarial.json` - passed.
- `git diff --check` - passed.

## Deviations from Plan

None - plan executed exactly as written.

## Self-Check: PASSED
40 changes: 40 additions & 0 deletions .planning/phases/02-shared-contracts-and-corpus/02-02-PLAN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
---
phase: 02-shared-contracts-and-corpus
plan: "02"
type: execute
wave: 2
depends_on: ["02-01"]
files_modified:
- benchmarks/v0.2/golden.json
- benchmarks/v0.2/adversarial.json
- tests/test_corpus.py
requirements: [CORP-01]
autonomous: true
must_haves:
truths:
- "The corpus represents core CAS engineering workflows."
- "All fixtures remain deterministic, reviewable, secretless, and safe."
---

<objective>
Expand the representative golden and adversarial benchmark corpus.
</objective>

<tasks>
<task type="auto">
<name>Author representative v0.2 corpus</name>
<read_first>benchmarks/v0.1/golden.json, benchmarks/v0.1/adversarial.json, .planning/phases/02-shared-contracts-and-corpus/02-CONTEXT.md</read_first>
<action>Add representative golden and adversarial cases with fixed release metadata and deterministic observations.</action>
<acceptance_criteria>Corpus tests prove unique IDs, required workflow coverage, safe fixtures, and passing suites.</acceptance_criteria>
</task>
</tasks>

<verification>
python -m unittest discover -s tests -v
python -m cas_evals.cli benchmarks/v0.2/golden.json
python -m cas_evals.cli benchmarks/v0.2/adversarial.json
</verification>

<success_criteria>
The v0.2 corpus gives representative, deterministic CAS workflow coverage.
</success_criteria>
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
---
phase: 02-shared-contracts-and-corpus
plan: "02"
status: complete
completed: 2026-06-11
requirements: [CORP-01]
---

# Plan 02-02 Summary

Added a v0.2 corpus with eight representative golden engineering workflows and six independent adversarial safety risks. Capability labels and tests make corpus coverage explicit and reviewable.

## Verification

- `python -m unittest discover -s tests -v` - 16 tests passed.
- `python -m cas_evals.cli benchmarks/v0.2/golden.json --output artifacts/v0.2-golden.json` - 8/8 passed.
- `python -m cas_evals.cli benchmarks/v0.2/adversarial.json --output artifacts/v0.2-adversarial.json` - 6/6 passed.
- `git diff --check` - passed.

## Deviations from Plan

None - plan executed exactly as written.

## Self-Check: PASSED
52 changes: 52 additions & 0 deletions .planning/phases/02-shared-contracts-and-corpus/02-03-PLAN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
---
phase: 02-shared-contracts-and-corpus
plan: "03"
type: execute
wave: 3
depends_on: ["02-02"]
files_modified:
- src/cas_evals/release.py
- scripts/verify.ps1
- releases/v0.2.0/manifest.json
- releases/v0.2.0/golden-results.json
- releases/v0.2.0/adversarial-results.json
- docs/benchmark-report-v0.2.md
- README.md
- .github/workflows/ci.yml
- tests/test_release.py
requirements: [REL-01]
autonomous: true
must_haves:
truths:
- "Benchmark release artifacts regenerate byte-identically."
- "Release artifacts contain provenance and digest evidence."
- "CI runs tests, both v0.2 suites, and release reproducibility validation."
---

<objective>
Publish deterministic benchmark release artifacts and verification automation.
</objective>

<tasks>
<task type="auto">
<name>Build deterministic release publisher</name>
<read_first>scripts/verify.ps1, .github/workflows/ci.yml, docs/benchmark-report-v0.1.md</read_first>
<action>Add a standard-library release generator, checked-in v0.2 artifacts, and byte-for-byte reproducibility tests.</action>
<acceptance_criteria>Release tests prove manifest digests and deterministic regeneration.</acceptance_criteria>
</task>
<task type="auto">
<name>Integrate release verification and documentation</name>
<read_first>README.md, scripts/verify.ps1, .github/workflows/ci.yml</read_first>
<action>Update local verification, CI, and documentation for the shared contract, v0.2 corpus, and release artifacts.</action>
<acceptance_criteria>The complete verification path passes without network or secrets.</acceptance_criteria>
</task>
</tasks>

<verification>
powershell -ExecutionPolicy Bypass -File scripts/verify.ps1
git diff --check
</verification>

<success_criteria>
Reviewable v0.2 benchmark release artifacts are published and reproducible.
</success_criteria>
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
---
phase: 02-shared-contracts-and-corpus
plan: "03"
status: complete
completed: 2026-06-11
requirements: [REL-01]
---

# Plan 02-03 Summary

Added a deterministic standard-library release publisher and checked-in v0.2.0 benchmark artifacts. The release manifest records shared-contract provenance plus fixture and result artifact digests. Local verification and cross-platform CI now run the v0.2 suites and reject release drift.

Replaced the stale Phase 1 local result schema with a suite evidence schema that references the vendored published shared result contract.

## Verification

- `powershell -ExecutionPolicy Bypass -File scripts/verify.ps1` - 20 tests, both suites, and release reproducibility passed.
- `python -m cas_evals.release --check` - passed.
- `python -m compileall -q src tests` - passed.
- `git diff --check` - passed.

## Deviations from Plan

**[Rule 2 - Missing Critical] Replaced stale local result schema** - The old local schema described the pre-shared-contract result shape and would mislead consumers. Replaced it with `evaluation-suite.schema.json`, which references the vendored published result contract.

**[Rule 2 - Missing Critical] Enforced JSON LF line endings** - Windows line-ending conversion could invalidate vendored schema and release digests after checkout. Added `.gitattributes` to preserve byte-identical JSON across platforms.

**Total deviations:** 2 auto-fixed missing critical requirements. **Impact:** Contract documentation matches emitted evidence and byte reproducibility survives cross-platform checkout.

## Self-Check: PASSED
Loading
Loading