Skip to content

leoncuhk/auto-dev-agentos

Repository files navigation

auto-dev-agentos

CI License: AGPL-3.0 Python 3.10+

Write a spec. Run the loop. Get a verified project.

A minimal, mode-pluggable engine that orchestrates LLM agents to develop projects, conduct algorithmic research, or audit codebases — autonomously.

Core thesis: Reliability in long-running AI agent tasks comes from system discipline — deterministic orchestration, stateless sessions, file-based state, mandatory verification — not from smarter models.

Architecture

auto-dev-agentos architecture

Each session is stateless. State lives in .state/ files. Session N+1 reads what Session N wrote. The engine decides what runs — the LLM only executes.

Engineer Researcher Auditor
Input spec.md hypothesis.md standards.md
Each session One task One experiment One finding
On failure Fix and retry Revert and learn Dismiss with evidence
Exit when All tasks pass Target metric hit All standards covered
State file tasks.json journal.json findings.json

Quick Start

git clone https://github.com/leoncuhk/auto-dev-agentos
cd auto-dev-agentos

# See what's available
./run.sh --list-modes

# Preview without invoking Claude (zero cost)
./run.sh --dry-run examples/todo-app

# Write a spec, run the engine
mkdir my-project && echo "# My App\nBuild a REST API..." > my-project/spec.md
./run.sh my-project

Prerequisites

Shell engine (run.sh):

brew install jq                         # macOS (or: apt-get install jq)
npm install -g @anthropic-ai/claude-code # Claude Code CLI

SDK engine (run.py — adds strategic review, hooks, cost tracking):

pip install claude-agent-sdk            # Python 3.10+

Usage

# Shell engine
./run.sh [--mode <mode>] [--dry-run] <project-dir> [max-sessions]

# SDK engine
python run.py [--mode <mode>] [--dry-run] <project-dir> [options]
Flag Default Description
--mode engineer engineer, researcher, or auditor
--dry-run Preview what would run, no LLM calls
--max-sessions 50 Session limit
--orient-interval 10 Strategic review interval (SDK engine only)
Env Variable Default Description
PAUSE_SEC 5 Seconds between sessions
REVIEW_INTERVAL 5 Tactical review every N sessions
NO_PROGRESS_MAX 3 Stuck detection threshold

Example: Researcher Mode

The quant-lab demo shows a complete research run — optimizing a trading strategy's Sharpe Ratio from 0.84 to 1.89 across 6 experiments:

Experiment Approach Result Decision
EXP-001 Optimize MA parameters 0.84 → 1.37 Accepted
EXP-002 MACD confirmation 1.37 → 0.72 Rejected (double-lag)
EXP-003 RSI position sizing 1.37 → 1.33 Rejected (fights trend)
EXP-004 Stop-loss Error Reverted (framework limitation)
EXP-005 Momentum + conviction sizing 1.37 → 1.89 Accepted — target exceeded
EXP-006 Adaptive MA windows 1.89 → 1.15 Rejected (boundary instability)

Failed experiments (002, 003, 004) directly informed the winning experiment (005). The loop works because failures accumulate as knowledge, not waste.

cd examples/quant-lab && python run_backtest.py   # verify: Sharpe = 1.89
cat .state/journal.json                            # full experiment log
cat .state/progress.md                             # session-by-session narrative

Project Structure

auto-dev-agentos/
├── run.sh              # Shell engine (single-loop)
├── run.py              # SDK engine (dual-loop, hooks, cost tracking)
├── core.py             # Shared pure functions
├── modes/
│   ├── engineer/       # spec.md → tasks → implement → verify
│   ├── researcher/     # hypothesis.md → experiment → evaluate → learn
│   └── auditor/        # standards.md → scan → analyze → report
├── tests/              # Unit tests (no SDK dependency)
├── docs/               # Design rationale and methodology
└── examples/           # Demo projects (todo-app, quant-lab, audit-demo)

Creating a New Mode

Create modes/<name>/ with mode.conf, CLAUDE.md, and prompts/. The engine picks up new modes automatically. See CONTRIBUTING.md.

Design Principles

These address the six failure modes of autonomous LLM agents:

Principle Failure mode it solves
Stateless sessions Context degradation — each session starts fresh
File-based state Context window limits — state survives indefinitely
One task per session Implementation drift — no room to simplify under pressure
Mandatory verification Overexcitement — metrics decide, not LLM self-assessment
Circuit breaker Infinite loops — stuck detection + max sessions
Deterministic orchestration All six — shell script decides flow, not LLM

FAQ

How much does it cost? Each session is one Claude Code invocation. --dry-run previews at zero cost. MAX_SESSIONS caps total runs.

Why --dangerously-skip-permissions? Headless mode — no human to click "approve." Safety comes from architecture: deterministic orchestration, one-task blast radius, git-versioned state, circuit breakers.

Can I resume after Ctrl+C? Yes. Same command again. The engine re-reads .state/ and continues.

Can I use a different LLM? Replace claude -p in run.sh with your CLI tool. The architecture is LLM-agnostic; the current implementation uses Claude.

References

Design:

Research:

Related tools:

Contributing

See CONTRIBUTING.md. Run python tests/test_run.py before submitting.

License

AGPL-3.0. See LICENSE.

About

Autonomous Development Agent OS — Deterministic orchestration of stateless LLM sessions for building projects, conducting research, and auditing codebases. Write a spec, run the loop, get a verified project.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors