Skip to content

originaonxi/reversion-study

Repository files navigation

reversion-study

Anthropic's Nov 2025 post described an external-state pattern for long-running coding agents. This repo implements a minimal version of that pattern and measures one failure mode it may help constrain: reversion.

What reversion means here

A reversion event = a session modifies a file explicitly mapped (via feature_to_files.json) to an already-passing feature, while assigned to work on a different feature.

The task may still complete. Completion metrics do not capture this instability.

What this repo does

  1. Implements a minimal external-state harness
  2. Assigns one feature per session with explicit file boundaries
  3. Measures reversion_rate = events / completed_features
  4. Compares harness arm vs baseline arm on one synthetic task

Results

BENCHMARK.md is a placeholder. Run python3 agent.py --task broken_saas --run both to generate real numbers.

Dry-run shows synthetic mock data only — see banner in output.

Running

pip install -r requirements.txt
python3 -m pytest tests/ -v
python3 agent.py --task broken_saas --dry-run

Real runs (requires ANTHROPIC_API_KEY):

python3 agent.py --task broken_saas --run harness
python3 agent.py --task broken_saas --run baseline
python3 agent.py --task broken_saas --run both

Limitations

Single synthetic task. Hand-authored file mapping. Same model in both arms. See ARCHITECTURE.md.

About

Research study — measuring LLM output reversion under iterative prompting. How models forget and regress across long conversations.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages