Anthropic's Nov 2025 post described an external-state pattern for long-running coding agents. This repo implements a minimal version of that pattern and measures one failure mode it may help constrain: reversion.
A reversion event = a session modifies a file explicitly mapped (via feature_to_files.json) to an already-passing feature, while assigned to work on a different feature.
The task may still complete. Completion metrics do not capture this instability.
- Implements a minimal external-state harness
- Assigns one feature per session with explicit file boundaries
- Measures reversion_rate = events / completed_features
- Compares harness arm vs baseline arm on one synthetic task
BENCHMARK.md is a placeholder.
Run python3 agent.py --task broken_saas --run both to
generate real numbers.
Dry-run shows synthetic mock data only — see banner in output.
pip install -r requirements.txt
python3 -m pytest tests/ -v
python3 agent.py --task broken_saas --dry-runReal runs (requires ANTHROPIC_API_KEY):
python3 agent.py --task broken_saas --run harness
python3 agent.py --task broken_saas --run baseline
python3 agent.py --task broken_saas --run bothSingle synthetic task. Hand-authored file mapping. Same model in both arms. See ARCHITECTURE.md.