Coding-Autopilot-System · OgeonX-Ai · Jun 11, 2026 · Jun 11, 2026 · Jun 11, 2026 · Jun 11, 2026
@@ -0,0 +1 @@
+*.json text eol=lf
@@ -22,5 +22,6 @@ jobs:
           python-version: ${{ matrix.python-version }}
       - run: python -m pip install -e .
       - run: python -m unittest discover -s tests -v
-      - run: python -m cas_evals.cli benchmarks/v0.1/golden.json
-      - run: python -m cas_evals.cli benchmarks/v0.1/adversarial.json
+      - run: python -m cas_evals.cli benchmarks/v0.2/golden.json
+      - run: python -m cas_evals.cli benchmarks/v0.2/adversarial.json
+      - run: python -m cas_evals.release --check
@@ -12,7 +12,11 @@ Every CAS capability claim can be reproduced from versioned fixtures and machine
 
 ### Validated
 
-(None yet - ship to validate)
+- [x] Shared evaluation results consume provenance-pinned published `cas-contracts` schemas offline.
+- [x] The public corpus represents core CAS engineering workflows and independent safety risks.
+- [x] Benchmark release artifacts regenerate byte-identically with machine-readable provenance.
+
+Validated in Phase 2: Shared Contracts And Corpus.
 
 ### Active
 
@@ -50,7 +54,9 @@ CAS needs measurable proof that its prompt refinement, autonomous engineering, a
 
 This document evolves at phase transitions and milestone boundaries.
 
+Phase 2 is complete. The repository now consumes shared contracts offline, runs a representative v0.2 corpus, and publishes reproducible v0.2.0 release evidence. Phase 3 adds isolated opt-in live adapters.
+
 After each phase, validate requirements, record new decisions, and update scope. After each milestone, review the core value, exclusions, and evidence quality.
 
 ---
-*Last updated: 2026-06-11 after initialization*
+*Last updated: 2026-06-11 after Phase 2 completion*
@@ -29,6 +29,12 @@
 - [ ] **GOV-01**: Contributor can run documented tests and benchmarks without secrets.
 - [ ] **GOV-02**: Repository publishes security and contribution guidance.
 
+### Shared Contracts And Corpus
+
+- [x] **SHRD-01**: Maintainer can validate emitted evaluation results against a provenance-pinned published `cas-contracts` schema without network access.
+- [x] **CORP-01**: User can run a representative golden-task corpus covering core CAS engineering workflows.
+- [x] **REL-01**: Maintainer can deterministically generate and publish reviewable benchmark release artifacts.
+
 ## v2 Requirements
 
 - **LIVE-01**: User can evaluate live model-provider responses through isolated adapters.
@@ -51,8 +57,9 @@
 | METR-01, METR-02, METR-03, METR-04 | Phase 1 | Complete |
 | EVID-01, EVID-02, EVID-03 | Phase 1 | Complete |
 | GOV-01, GOV-02 | Phase 1 | Complete |
+| SHRD-01, CORP-01, REL-01 | Phase 2 | Complete |
 
-**Coverage:** 12 v1 requirements, 12 mapped, 0 unmapped.
+**Coverage:** 15 v1 requirements, 15 mapped, 0 unmapped.
 
 ---
 *Last updated: 2026-06-11 after v0.1 scaffold*
@@ -16,10 +16,21 @@
 
 Consume versioned `cas-contracts` schemas, expand representative CAS golden tasks, and publish benchmark release artifacts.
 
+**Status:** Complete (2026-06-11)
+
 ## Phase 3: Isolated Live Adapters
 
 Add opt-in provider adapters with redaction, managed identity where applicable, cost controls, and recorded provenance.
 
 ## Phase 4: Statistical And Longitudinal Evidence
 
 Add repeated-run statistics, baseline comparison, regression budgets, signed reports, and a public trend dashboard.
+
+## Progress
+
+| Phase | Status | Completed |
+|-------|--------|-----------|
+| 1. Reproducible Evaluation Kernel | Complete | 2026-06-11 |
+| 2. Shared Contracts And Corpus | Complete | 2026-06-11 |
+| 3. Isolated Live Adapters | Pending | - |
+| 4. Statistical And Longitudinal Evidence | Pending | - |
@@ -1,11 +1,26 @@
+---
+gsd_state_version: 1.0
+milestone: v0.1
+milestone_name: milestone
+status: ready_to_plan
+last_updated: 2026-06-11T11:09:44.109Z
+progress:
+  total_phases: 4
+  completed_phases: 2
+  total_plans: 3
+  completed_plans: 3
+  percent: 50
+stopped_at: Phase 2 complete (3/3) — ready to discuss Phase 3
+---
+
 # Project State
 
 ## Project Reference
 
 See: `.planning/PROJECT.md` (updated 2026-06-11)
 
 **Core value:** Every CAS capability claim can be reproduced from versioned fixtures and machine-readable results.
-**Current focus:** Phase 1 complete; prepare shared-contract integration.
+**Current focus:** Phase 3 — isolated live adapters
 
 ## Status
 

@@ -10,6 +10,6 @@
     "verifier": true,
     "nyquist_validation": true,
     "auto_advance": true,
-    "_auto_chain_active": true
+    "_auto_chain_active": false
   }
 }
@@ -0,0 +1,49 @@
+---
+phase: 02-shared-contracts-and-corpus
+plan: "01"
+type: execute
+wave: 1
+depends_on: []
+files_modified:
+  - vendor/cas-contracts/v0.1.0/common.schema.json
+  - vendor/cas-contracts/v0.1.0/evaluation-result.schema.json
+  - vendor/cas-contracts/v0.1.0/provenance.json
+  - src/cas_evals/contracts.py
+  - src/cas_evals/evaluator.py
+  - tests/test_contracts.py
+  - tests/test_evaluator.py
+requirements: [SHRD-01]
+autonomous: true
+must_haves:
+  truths:
+    - "Per-case results validate against the published shared EvaluationResult contract."
+    - "Contract validation requires no network, secrets, or third-party runtime packages."
+    - "Vendored schema provenance is verified."
+---
+
+<objective>
+Consume the published shared evaluation contract and align deterministic evaluator output.
+</objective>
+
+<tasks>
+<task type="auto">
+  <name>Vendor and verify published shared schemas</name>
+  <read_first>AGENTS.md, .planning/phases/02-shared-contracts-and-corpus/02-CONTEXT.md</read_first>
+  <action>Vendor exact v0.1.0 common and evaluation-result schemas with immutable provenance. Add a standard-library validator that checks provenance and the current schema constraint surface.</action>
+  <acceptance_criteria>Contract tests pass offline and reject malformed results.</acceptance_criteria>
+</task>
+<task type="auto">
+  <name>Align evaluator output to shared contract</name>
+  <read_first>src/cas_evals/evaluator.py, tests/test_evaluator.py</read_first>
+  <action>Emit exact shared EvaluationResult objects and preserve detailed mandatory-gate evidence in the suite envelope.</action>
+  <acceptance_criteria>Existing and new evaluator tests pass; safety remains independently mandatory.</acceptance_criteria>
+</task>
+</tasks>
+
+<verification>
+python -m unittest discover -s tests -v
+</verification>
+
+<success_criteria>
+Shared contract consumption is pinned, offline, tested, and used by evaluator output.
+</success_criteria>
@@ -0,0 +1,26 @@
+---
+phase: 02-shared-contracts-and-corpus
+plan: "01"
+status: complete
+completed: 2026-06-11
+requirements: [SHRD-01]
+---
+
+# Plan 02-01 Summary
+
+Vendored the immutable `cas-contracts` v0.1.0 common and evaluation-result schemas with source, blob SHA, and SHA-256 provenance. Added a standard-library offline validator and aligned every emitted per-case result to the published shared contract.
+
+Detailed thresholds, fixture digests, and mandatory gate decisions remain in the suite evidence envelope so shared results reject local extensions while safety remains independently mandatory.
+
+## Verification
+
+- `python -m unittest discover -s tests -v` - 12 tests passed.
+- `python -m cas_evals.cli benchmarks/v0.1/golden.json --output artifacts/golden.json` - passed.
+- `python -m cas_evals.cli benchmarks/v0.1/adversarial.json --output artifacts/adversarial.json` - passed.
+- `git diff --check` - passed.
+
+## Deviations from Plan
+
+None - plan executed exactly as written.
+
+## Self-Check: PASSED
@@ -0,0 +1,40 @@
+---
+phase: 02-shared-contracts-and-corpus
+plan: "02"
+type: execute
+wave: 2
+depends_on: ["02-01"]
+files_modified:
+  - benchmarks/v0.2/golden.json
+  - benchmarks/v0.2/adversarial.json
+  - tests/test_corpus.py
+requirements: [CORP-01]
+autonomous: true
+must_haves:
+  truths:
+    - "The corpus represents core CAS engineering workflows."
+    - "All fixtures remain deterministic, reviewable, secretless, and safe."
+---
+
+<objective>
+Expand the representative golden and adversarial benchmark corpus.
+</objective>
+
+<tasks>
+<task type="auto">
+  <name>Author representative v0.2 corpus</name>
+  <read_first>benchmarks/v0.1/golden.json, benchmarks/v0.1/adversarial.json, .planning/phases/02-shared-contracts-and-corpus/02-CONTEXT.md</read_first>
+  <action>Add representative golden and adversarial cases with fixed release metadata and deterministic observations.</action>
+  <acceptance_criteria>Corpus tests prove unique IDs, required workflow coverage, safe fixtures, and passing suites.</acceptance_criteria>
+</task>
+</tasks>
+
+<verification>
+python -m unittest discover -s tests -v
+python -m cas_evals.cli benchmarks/v0.2/golden.json
+python -m cas_evals.cli benchmarks/v0.2/adversarial.json
+</verification>
+
+<success_criteria>
+The v0.2 corpus gives representative, deterministic CAS workflow coverage.
+</success_criteria>
@@ -0,0 +1,24 @@
+---
+phase: 02-shared-contracts-and-corpus
+plan: "02"
+status: complete
+completed: 2026-06-11
+requirements: [CORP-01]
+---
+
+# Plan 02-02 Summary
+
+Added a v0.2 corpus with eight representative golden engineering workflows and six independent adversarial safety risks. Capability labels and tests make corpus coverage explicit and reviewable.
+
+## Verification
+
+- `python -m unittest discover -s tests -v` - 16 tests passed.
+- `python -m cas_evals.cli benchmarks/v0.2/golden.json --output artifacts/v0.2-golden.json` - 8/8 passed.
+- `python -m cas_evals.cli benchmarks/v0.2/adversarial.json --output artifacts/v0.2-adversarial.json` - 6/6 passed.
+- `git diff --check` - passed.
+
+## Deviations from Plan
+
+None - plan executed exactly as written.
+
+## Self-Check: PASSED
@@ -0,0 +1,52 @@
+---
+phase: 02-shared-contracts-and-corpus
+plan: "03"
+type: execute
+wave: 3
+depends_on: ["02-02"]
+files_modified:
+  - src/cas_evals/release.py
+  - scripts/verify.ps1
+  - releases/v0.2.0/manifest.json
+  - releases/v0.2.0/golden-results.json
+  - releases/v0.2.0/adversarial-results.json
+  - docs/benchmark-report-v0.2.md
+  - README.md
+  - .github/workflows/ci.yml
+  - tests/test_release.py
+requirements: [REL-01]
+autonomous: true
+must_haves:
+  truths:
+    - "Benchmark release artifacts regenerate byte-identically."
+    - "Release artifacts contain provenance and digest evidence."
+    - "CI runs tests, both v0.2 suites, and release reproducibility validation."
+---
+
+<objective>
+Publish deterministic benchmark release artifacts and verification automation.
+</objective>
+
+<tasks>
+<task type="auto">
+  <name>Build deterministic release publisher</name>
+  <read_first>scripts/verify.ps1, .github/workflows/ci.yml, docs/benchmark-report-v0.1.md</read_first>
+  <action>Add a standard-library release generator, checked-in v0.2 artifacts, and byte-for-byte reproducibility tests.</action>
+  <acceptance_criteria>Release tests prove manifest digests and deterministic regeneration.</acceptance_criteria>
+</task>
+<task type="auto">
+  <name>Integrate release verification and documentation</name>
+  <read_first>README.md, scripts/verify.ps1, .github/workflows/ci.yml</read_first>
+  <action>Update local verification, CI, and documentation for the shared contract, v0.2 corpus, and release artifacts.</action>
+  <acceptance_criteria>The complete verification path passes without network or secrets.</acceptance_criteria>
+</task>
+</tasks>
+
+<verification>
+powershell -ExecutionPolicy Bypass -File scripts/verify.ps1
+git diff --check
+</verification>
+
+<success_criteria>
+Reviewable v0.2 benchmark release artifacts are published and reproducible.
+</success_criteria>
@@ -0,0 +1,30 @@
+---
+phase: 02-shared-contracts-and-corpus
+plan: "03"
+status: complete
+completed: 2026-06-11
+requirements: [REL-01]
+---
+
+# Plan 02-03 Summary
+
+Added a deterministic standard-library release publisher and checked-in v0.2.0 benchmark artifacts. The release manifest records shared-contract provenance plus fixture and result artifact digests. Local verification and cross-platform CI now run the v0.2 suites and reject release drift.
+
+Replaced the stale Phase 1 local result schema with a suite evidence schema that references the vendored published shared result contract.
+
+## Verification
+
+- `powershell -ExecutionPolicy Bypass -File scripts/verify.ps1` - 20 tests, both suites, and release reproducibility passed.
+- `python -m cas_evals.release --check` - passed.
+- `python -m compileall -q src tests` - passed.
+- `git diff --check` - passed.
+
+## Deviations from Plan
+
+**[Rule 2 - Missing Critical] Replaced stale local result schema** - The old local schema described the pre-shared-contract result shape and would mislead consumers. Replaced it with `evaluation-suite.schema.json`, which references the vendored published result contract.
+
+**[Rule 2 - Missing Critical] Enforced JSON LF line endings** - Windows line-ending conversion could invalidate vendored schema and release digests after checkout. Added `.gitattributes` to preserve byte-identical JSON across platforms.
+
+**Total deviations:** 2 auto-fixed missing critical requirements. **Impact:** Contract documentation matches emitted evidence and byte reproducibility survives cross-platform checkout.
+
+## Self-Check: PASSED