auto-dev-agentos

Write a spec. Run the loop. Get a verified project.

A minimal, mode-pluggable engine that orchestrates LLM agents to develop projects, conduct algorithmic research, or audit codebases — autonomously.

Core thesis: Reliability in long-running AI agent tasks comes from system discipline — deterministic orchestration, stateless sessions, file-based state, mandatory verification — not from smarter models.

Architecture

Each session is stateless. State lives in .state/ files. Session N+1 reads what Session N wrote. The engine decides what runs — the LLM only executes.

	Engineer	Researcher	Auditor
Input	`spec.md`	`hypothesis.md`	`standards.md`
Each session	One task	One experiment	One finding
On failure	Fix and retry	Revert and learn	Dismiss with evidence
Exit when	All tasks pass	Target metric hit	All standards covered
State file	`tasks.json`	`journal.json`	`findings.json`

Quick Start

git clone https://github.com/leoncuhk/auto-dev-agentos
cd auto-dev-agentos

# See what's available
./run.sh --list-modes

# Preview without invoking Claude (zero cost)
./run.sh --dry-run examples/todo-app

# Write a spec, run the engine
mkdir my-project && echo "# My App\nBuild a REST API..." > my-project/spec.md
./run.sh my-project

Prerequisites

Shell engine (run.sh):

brew install jq                         # macOS (or: apt-get install jq)
npm install -g @anthropic-ai/claude-code # Claude Code CLI

SDK engine (run.py — adds strategic review, hooks, cost tracking):

pip install claude-agent-sdk            # Python 3.10+

Usage

# Shell engine
./run.sh [--mode <mode>] [--dry-run] <project-dir> [max-sessions]

# SDK engine
python run.py [--mode <mode>] [--dry-run] <project-dir> [options]

Flag	Default	Description
`--mode`	`engineer`	`engineer`, `researcher`, or `auditor`
`--dry-run`		Preview what would run, no LLM calls
`--max-sessions`	`50`	Session limit
`--orient-interval`	`10`	Strategic review interval (SDK engine only)

Env Variable	Default	Description
`PAUSE_SEC`	`5`	Seconds between sessions
`REVIEW_INTERVAL`	`5`	Tactical review every N sessions
`NO_PROGRESS_MAX`	`3`	Stuck detection threshold

Example: Researcher Mode

The quant-lab demo shows a complete research run — optimizing a trading strategy's Sharpe Ratio from 0.84 to 1.89 across 6 experiments:

Experiment	Approach	Result	Decision
EXP-001	Optimize MA parameters	0.84 → 1.37	Accepted
EXP-002	MACD confirmation	1.37 → 0.72	Rejected (double-lag)
EXP-003	RSI position sizing	1.37 → 1.33	Rejected (fights trend)
EXP-004	Stop-loss	Error	Reverted (framework limitation)
EXP-005	Momentum + conviction sizing	1.37 → 1.89	Accepted — target exceeded
EXP-006	Adaptive MA windows	1.89 → 1.15	Rejected (boundary instability)

Failed experiments (002, 003, 004) directly informed the winning experiment (005). The loop works because failures accumulate as knowledge, not waste.

cd examples/quant-lab && python run_backtest.py   # verify: Sharpe = 1.89
cat .state/journal.json                            # full experiment log
cat .state/progress.md                             # session-by-session narrative

Project Structure

auto-dev-agentos/
├── run.sh              # Shell engine (single-loop)
├── run.py              # SDK engine (dual-loop, hooks, cost tracking)
├── core.py             # Shared pure functions
├── modes/
│   ├── engineer/       # spec.md → tasks → implement → verify
│   ├── researcher/     # hypothesis.md → experiment → evaluate → learn
│   └── auditor/        # standards.md → scan → analyze → report
├── tests/              # Unit tests (no SDK dependency)
├── docs/               # Design rationale and methodology
└── examples/           # Demo projects (todo-app, quant-lab, audit-demo)

Creating a New Mode

Create modes/<name>/ with mode.conf, CLAUDE.md, and prompts/. The engine picks up new modes automatically. See CONTRIBUTING.md.

Design Principles

These address the six failure modes of autonomous LLM agents:

Principle	Failure mode it solves
Stateless sessions	Context degradation — each session starts fresh
File-based state	Context window limits — state survives indefinitely
One task per session	Implementation drift — no room to simplify under pressure
Mandatory verification	Overexcitement — metrics decide, not LLM self-assessment
Circuit breaker	Infinite loops — stuck detection + max sessions
Deterministic orchestration	All six — shell script decides flow, not LLM

FAQ

How much does it cost? Each session is one Claude Code invocation. --dry-run previews at zero cost. MAX_SESSIONS caps total runs.

Why --dangerously-skip-permissions? Headless mode — no human to click "approve." Safety comes from architecture: deterministic orchestration, one-task blast radius, git-versioned state, circuit breakers.

Can I resume after Ctrl+C? Yes. Same command again. The engine re-reads .state/ and continues.

Can I use a different LLM? Replace claude -p in run.sh with your CLI tool. The architecture is LLM-agnostic; the current implementation uses Claude.

References

Design:

Design Rationale — Why this architecture, what alternatives were considered
Peirce's Inquiry Cycle — Why three roles per mode is logically irreducible
Stateless Agent Architecture — Full argument for stateless sessions
Dual-Loop Architecture — Strategic orientation via OODA outer loop

Research:

Why LLMs Aren't Scientists Yet — Six failure modes in autonomous LLM research (arXiv, 2026)
Building Effective AI Coding Agents — Scaffolding + harness architecture (arXiv, 2026)
Anthropic Agentic Coding Trends — Industry landscape (2026)

Related tools:

Claude Code — Terminal-native AI agent by Anthropic
Claude Agent SDK — Python SDK for agent loops
GitHub Spec Kit — Spec-driven development toolkit
OpenHands — Full-platform autonomous coding agent
SWE-agent — GitHub issue resolution agent
Aider — Interactive AI pair programming
Sakana AI Scientist v2 — Autonomous research via tree search
Gas Town — Multi-agent parallel orchestration
Goose — MCP-native extensible agent framework
BMAD — Multi-agent development method (26 agents)

Contributing

See CONTRIBUTING.md. Run python tests/test_run.py before submitting.

License

AGPL-3.0. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

auto-dev-agentos

Architecture

Quick Start

Prerequisites

Usage

Example: Researcher Mode

Project Structure

Creating a New Mode

Design Principles

FAQ

References

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github		.github
assets		assets
docs		docs
examples		examples
modes		modes
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
core.py		core.py
requirements.txt		requirements.txt
run.py		run.py
run.sh		run.sh

Folders and files

Latest commit

History

Repository files navigation

auto-dev-agentos

Architecture

Quick Start

Prerequisites

Usage

Example: Researcher Mode

Project Structure

Creating a New Mode

Design Principles

FAQ

References

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages