π± Sow ideas. π Run experiments. πΎ Harvest evidence. π
Quick Start Β· How It Works Β· Agents Β· TUI Dashboard Β· CLI Reference Β· Configuration Β· Examples
-
π One
runCommand:paperfarm runbootstraps a new workflow when.research/is missing, or resumes an existing workflow when it already exists. -
π€ Multi-Agent Support: Works with Claude Code, Codex CLI, Aider, OpenCode, Kimi CLI, and Gemini CLI β auto-detects the first installed agent, or pick your own.
-
π¬ Scout β Prepare β Review β Experiment Flow: AI agent analyzes your codebase, resolves install/data/smoke bootstrap steps, then runs the
research-v1loop β keeping what works, discarding what doesn't. -
π₯οΈ Research Command Center TUI: A 3-tab
Execution / Metrics / Logsdashboard with frontier table, parallel worker status, trend chart with results table, and color-coded event stream. -
π‘οΈ Safety First: Every experiment is an isolated git commit. Failed experiments auto-rollback. Timeout watchdog, crash counter, and max-experiments limit keep things under control.
-
π§ Research-v1 Runtime: A single
Scout -> Manager -> Critic -> Experimentloop keeps research state explicit and reviewable. -
π‘ Headless Mode: Run without TUI β outputs structured JSON Lines to stdout, perfect for scripts, CI, or monitoring with external tools.
-
β‘ Parallel Workers: Run experiments across multiple GPUs in isolated git worktrees β workers can't interfere with each other.
pip install PaperFarm
cd your-project
paperfarm runThis launches a 4-phase flow:
Plant the first seed with paperfarm run, then let the field work:
- Scout β survey the field: analyze your codebase, search related work, and design evaluation metrics
- Prepare β prepare the soil: resolve a local Python env, install command, data/setup step, and a readiness smoke check
- Review β inspect the crop plan: review the analysis and prepare results in an interactive TUI, then confirm or edit the plan
- Experiment β plant, test, and harvest:
Manager -> Critic -> Experimentruns the research loop autonomously, keeping what improves metrics
If you want to inspect exactly what run will use before it touches the repo, use:
paperfarm run --dry-run
paperfarm doctorRun without TUI β perfect for scripts, CI, or monitoring with external tools:
paperfarm run --mode headless --goal "reduce val_loss below 0.3" --max-experiments 20Outputs structured JSON Lines to stdout, one event per line:
{"ts": "2026-03-10T12:34:56Z", "level": "info", "phase": "scouting", "event": "scout_started"}
{"ts": "2026-03-10T12:40:00Z", "level": "info", "phase": "preparing", "event": "prepare_step_completed", "step": "smoke", "status": "completed"}
{"ts": "2026-03-10T12:45:00Z", "level": "info", "phase": "experimenting", "event": "experiment_completed", "idea": "idea-001", "metric_value": 0.95, "experiment_num": 3, "max_experiments": 20}
{"ts": "2026-03-10T12:50:00Z", "level": "info", "phase": "done", "event": "limit_reached", "detail": "Max experiments (20) reached"}Also writes to .research/events.jsonl for persistent logging. Interactive mode now writes the same canonical event stream, so TUI and headless share one runtime log.
pip install PaperFarm
cd your-project
paperfarm init # Initialize .research/ directory
paperfarm run --agent claude-code # Launch with TUI dashboard
# Go to sleep. Check results in the morning:
paperfarm status --sparkline
paperfarm results --chart primaryTry the interactive demo β no agent or API key needed:
paperfarm demo # run in terminal paperfarm demo --serve # open in browser at http://localhost:8000 paperfarm demo --serve --port 9000
Open Researcher generates a .research/ directory in your repo with everything needed for autonomous research.
π .research/ Directory Structure
| File | Purpose |
|---|---|
scout_program.md |
Scout agent instructions β project analysis phase |
.internal/role_programs/*.md |
Internal runtime role prompts (manager / critic / experiment), auto-managed |
config.yaml |
Mode, metrics, timeout, experiment limits, agent settings, and bootstrap.* overrides |
project-understanding.md |
Agent fills: what the project does |
research-strategy.md |
Agent fills: research direction and focus areas |
literature.md |
Agent fills: related work and prior art |
evaluation.md |
Agent fills: how to measure improvement |
bootstrap_state.json |
Canonical install/data/smoke state for repo readiness |
prepare.log |
Raw logs from env install, data prep, and smoke execution |
idea_pool.json |
Projected experiment backlog with priority, status, and worker claim metadata |
results.tsv |
Experiment log (timestamp, commit, metrics, status) |
events.jsonl |
Canonical runtime event stream for research + control |
research_graph.json |
Canonical hypothesis / experiment / evidence graph |
research_memory.json |
Repo prior, ideation, and experiment memory |
control.json |
Compatibility snapshot of pause/resume/skip state |
activity.json |
Real-time agent status for TUI display |
π The Scout β Prepare β Review β Experiment Flow
Phase 0: Bootstrap
ββ Auto-init .research/ if needed, load config
Phase 1: Goal Input
ββ Optional research goal (TUI modal or --goal flag)
Phase 2: Scout Analysis
ββ Read codebase β project-understanding.md
ββ Search related work β literature.md
ββ Define strategy β research-strategy.md
ββ Design evaluation + bootstrap hints β evaluation.md + config.yaml
Phase 3: Repository Prepare
ββ Resolve local Python env
ββ Resolve install_command / data_command / smoke_command
ββ Run install/data/smoke with logs in .research/prepare.log
ββ Persist readiness state in .research/bootstrap_state.json
Phase 4: Human Review (TUI only, auto-confirmed in headless)
ββ Review all Scout outputs
ββ Review bootstrap resolution and readiness
ββ Confirm, edit, or re-analyze
Phase 5: Research-v1 Loop
ββ Manager proposes/refines hypotheses and frontier rows
ββ Critic reviews experiment specs before execution
ββ Experiment agent implements, tests, and evaluates β results.tsv
ββ Critic records evidence and claim updates into research_graph.json
ββ Repeat until no runnable frontier remains or --max-experiments reached
Each experiment is a git commit. Successful experiments stay; failed ones are rolled back. Everything is logged in results.tsv.
π§° Auto-Prepare Resolution Rules
paperfarm run now tries to make a local Python repo runnable before the research loop starts.
- Python env priority: explicit
bootstrap.pythonβ active virtualenv β repo.venvβ auto-create.venv - Install priority: explicit
bootstrap.install_commandβuv syncβpoetry installβpython -m pip install -r requirements.txtβpython -m pip install -e . - Data/setup priority: explicit
bootstrap.data_commandβmake setup|prepare|data|download-dataβscripts/prepare*.py/scripts/download*.py/data/*/prepare.py - Smoke priority: explicit
bootstrap.smoke_commandβ first runnable command block from.research/evaluation.mdβpytest -qβmake test
If a command cannot be resolved safely, run stops before the review/runtime stage and records the failure in .research/bootstrap_state.json.
| Feature | Description |
|---|---|
| Isolated git commits | Every experiment is a separate commit β nothing is lost |
| Auto-rollback | Failed experiments are automatically rolled back via git reset |
| Timeout watchdog | Kills experiments exceeding the configured time limit |
| Crash counter | Auto-pauses after N consecutive crashes (default: 3) |
| Max experiments | Stops after N experiments (--max-experiments or config.yaml) |
| Control plane | Pause / resume / skip commands are event-backed in events.jsonl, with control.json kept as a compatibility snapshot |
| Failure memory | Persistent ledger of past failures, ranked by recovery success |
| Phase gate | In collaborative mode, pauses between phase transitions |
| Parallel workers | Run experiments across multiple GPUs in isolated worktrees |
| Agent | Command | Status |
|---|---|---|
| Claude Code | --agent claude-code |
Supported |
| Codex CLI | --agent codex |
Supported |
| Aider | --agent aider |
Supported |
| OpenCode | --agent opencode |
Supported |
| Kimi CLI | --agent kimi-cli |
Supported |
| Gemini CLI | --agent gemini-cli |
Supported |
Auto-detection: If you don't specify --agent, Open Researcher finds the first installed one.
βοΈ Agent Configuration
Customize agent parameters in .research/config.yaml:
agents:
claude-code:
model: "claude-sonnet-4-5-20250514" # override model
allowed_tools: "Edit,Write,Bash,Read,Glob,Grep"
extra_flags: ["--max-turns", "50"]
codex:
model: "gpt-5.2" # override default
sandbox: "workspace-write" # workspace-write | read-only | danger-full-access | full-auto
aider:
model: "gpt-4o"
extra_flags: ["--no-git"]
opencode:
model: "openai/gpt-5"
agent: "builder"
extra_flags: ["--share"]
kimi-cli:
model: "" # optional model override
agent: "okabe" # optional built-in agent profile
agent_file: "" # custom agent file path (optional)
extra_flags: ["--thinking"]
gemini-cli:
model: "gemini-3.1-pro" # override default model
sandbox: "" # optional sandbox mode
extra_flags: []The interactive TUI is a research command center built around the runtime state in .research/: frontier items, experiment results, worker status, and the event stream. Three tabs β Execution (frontier + workers), Metrics (summary stats + trend chart + results table), and Logs (color-coded event stream). Supports human-in-the-loop checkpoints β review hypotheses, override results, inject ideas, and edit goals without leaving the terminal.
Execution: frontier table sorted by priority with colored status, parallel workers running on multiple GPUs.
Metrics: summary stats (kept/discarded/best/mean/latest), braille trend chart, and scrollable results table.
Logs: color-coded event stream with aligned prefixes β SKILL / DONE / W+ / W- / RES / WAIT / REVW / INJ / GOAL events across rounds.
Hypothesis Review: human-in-the-loop checkpoint β toggle, approve all, or reject frontier items before the next round.
Paused: one-key pause/resume with bold indicator on the status bar.
Completed: all phases checked off, final frontier state with best metric displayed.
πΌοΈ More Screenshots
Result Review: override AI keep/discard decisions and add constraints for the next round.
Inject Experiment: add a human-authored idea to the frontier with priority.
Edit Goal: update research constraints and direction mid-run.
Stress Test: round 10, 8 frontier items, 6 parallel workers across GPUs β scales smoothly.
Idle: clean initial state before any research round starts.
Failed: bold red indicator when the research loop encounters an unrecoverable error.
π 3 Tabs & Keyboard Shortcuts
3 tabs:
- Execution β Frontier table (sorted by priority, colored status) + Workers panel (GPU, frontier assignment, live status)
- Metrics β Summary stats bar (kept/discarded/best/mean/latest + trend arrow) + braille trend chart + scrollable results table
- Logs β Color-coded event stream: SKILL / DONE / OUT / W+ / W- / RES / WAIT / REVW / INJ / GOAL
Keyboard shortcuts: p pause, r resume, s skip, g edit goal, i inject idea, q quit.
π Human-in-the-Loop Checkpoints
- Hypothesis Review β After manager proposes ideas, review frontier items: toggle keep/reject, approve all, or skip.
- Result Review β After experiments complete, review AI decisions (keep/discard) and override any result.
- Inject Idea (
ikey) β Add a human-authored experiment to the frontier at any time. - Edit Goal (
gkey) β Update research constraints and direction mid-run. - Pause/Resume (
p/rkeys) β Temporarily halt the research loop.
Open Researcher supports Linux, macOS, and Windows. Python 3.10+ required.
pip install PaperFarm
# Try the demo first (no agent or API key needed)
paperfarm demo # run in terminal
paperfarm demo --serve # open in browser at http://localhost:8000
# Install browser support (optional)
pip install "PaperFarm[serve]"
# Then use it for real
cd your-project
paperfarm runπ§ Linux / π macOS / π» Windows
git clone https://github.com/shatianming5/PaperFarm.git
cd PaperFarm
make dev # install with dev dependencies
make test # run tests
make test-cov # run tests with coverage gate (>=75%)
make lint # run linter
make package-check # build wheel + install + CLI smoke test
make ci # full local CI: lint + test + coverage + package smokeAll commands:
paperfarm <command>
β‘ Core Commands
| Command | What It Does |
|---|---|
run |
Primary command: bootstrap if needed, otherwise run the existing workflow |
run --mode headless --goal "..." --max-experiments N |
Headless JSON Lines mode |
run --workers N |
Set experiment worker count for serial or parallel execution |
init [--tag NAME] |
Initialize .research/ directory |
demo |
Try the TUI with sample data (no agent needed) |
demo --serve [--port N] |
Serve the demo TUI in a browser (requires PaperFarm[serve]) |
Hidden compatibility alias: start still works for older scripts, but it is deprecated. Use run.
π Monitoring & Results
| Command | What It Does |
|---|---|
status [--sparkline] |
Show experiment progress |
results [--chart primary] [--json] |
Print results table or chart |
logs [--follow] [--errors] |
View agent logs |
export |
Export markdown report |
π‘ Idea Management
| Command | What It Does |
|---|---|
ideas list |
Inspect the projected backlog currently derived from research_graph.json |
ideas add "description" |
Compatibility command that now refuses mutation under research-v1 |
ideas delete IDEA_ID |
Compatibility command that now refuses mutation under research-v1 |
ideas prioritize |
Compatibility command that now refuses mutation under research-v1 |
π§ Utilities & Diagnostics
| Command | What It Does |
|---|---|
config show |
View/validate configuration |
doctor |
Health check environment |
Edit .research/config.yaml:
ποΈ Full Configuration Reference
mode: autonomous # autonomous | collaborative
experiment:
timeout: 600 # seconds per experiment before kill
max_consecutive_crashes: 3 # pause after N consecutive crashes
max_experiments: 0 # 0 = unlimited; set to N to stop after N experiments
max_parallel_workers: 0 # 0 = auto (one per GPU), 1 = serial
worker_agent: "" # agent for sub-workers (default: same as master)
metrics:
primary:
name: "" # filled by agent (e.g., "val_loss")
direction: "" # higher_is_better | lower_is_better
environment: |
# Free-form notes for agents. Runtime execution uses bootstrap.* below.
bootstrap:
auto_prepare: true # run install/data/smoke before review/runtime
working_dir: "." # relative to repo root
python: "" # explicit python path if needed
install_command: "" # explicit dependency install command
data_command: "" # explicit dataset/setup command
smoke_command: "" # explicit readiness check command
expected_paths: [] # files/dirs that data/setup must materialize
requires_gpu: false # fail prepare if GPU is required but unavailable
research:
protocol: research-v1
manager_batch_size: 3
critic_repro_policy: best_or_surprising
memory:
ideation: true
experiment: true
repo_type_prior: true
roles:
scout_agent: "" # optional override
manager_agent: "" # optional override
critic_agent: "" # optional override
experiment_agent: "" # optional override
gpu:
remote_hosts: [] # optional remote GPU allocation hosts
agents: # per-agent overrides (optional)
claude-code:
model: ""
allowed_tools: "Edit,Write,Bash,Read,Glob,Grep"π― Core System
| Module | Description |
|---|---|
cli.py |
CLI entry point, all commands (Typer) |
run_cmd.py |
Unified workflow entrypoint: bootstrap flow + existing-workflow runner |
headless.py |
Headless mode (JSON Lines output) |
init_cmd.py |
Initialize .research/ directory |
config.py |
Configuration parsing |
π€ Agent Adapters (agents/)
| Module | Description |
|---|---|
base.py |
AgentAdapter abstract base class |
claude_code.py |
Claude Code adapter |
codex.py |
Codex CLI adapter |
aider.py |
Aider adapter |
opencode.py |
OpenCode adapter |
kimi.py |
Kimi CLI adapter |
gemini.py |
Gemini CLI adapter |
π TUI Components (tui/)
| Module | Description |
|---|---|
app.py |
Main Textual application for the 4-tab research command center |
widgets.py |
Command, execution, logs, docs, lineage, frontier, and detail drawer widgets |
view_model.py |
TUI-specific aggregation layer from graph / memory / results / events into renderable state |
review.py |
Post-Scout review TUI |
modals.py |
Modal dialogs (AddIdea, GPUStatus, Log) |
tui_runner.py |
Shared Textual session lifecycle for bootstrap and existing-workflow entrypoints |
styles.css |
CSS styling |
βοΈ Runtime Engine
| Module | Description |
|---|---|
idea_pool.py |
Serial idea backlog plus parallel claim handling for workers |
research_loop.py |
Shared Scout β Manager β Critic β Experiment core loop |
research_events.py |
Typed event contract shared by TUI and headless |
event_journal.py |
Shared JSONL journal for runtime and control events |
control_plane.py |
Runtime control (pause/resume/skip) |
failure_memory.py |
Failure memory ledger (categorize, improve fixes) |
worker.py |
Parallel worker management (multi-GPU) |
worktree.py |
Git worktree management (worker isolation) |
gpu_manager.py |
GPU allocation (local/remote) |
watchdog.py |
Timeout watchdog (kill runaway experiments) |
crash_counter.py |
Crash counter (auto-pause after N failures) |
phase_gate.py |
Phase gate (collaborative mode confirmation) |
activity.py |
Activity monitor (real-time agent status) |
See examples/ for complete setups:
- nanoGPT β Reduce validation loss in character-level language model training
- Liger-Kernel β Optimize Triton GPU kernels
- HF GLUE β Improve HuggingFace Transformers fine-tuning
- CIFAR-10 Speedrun β Maximize CIFAR-10 image classification accuracy
- YOLO Tiny β Optimize YOLOv8 object detection on COCO8
- Whisper Fine-tune β Reduce Whisper speech recognition word error rate
- CartPole RL β Maximize CartPole-v1 reinforcement learning reward
- Code Perf β Optimize Python JSON parser throughput (non-ML)
Contributions are welcome! Please follow these steps:
- Open an issue to discuss the proposed change
- Fork the repository and create your feature branch
- Submit a pull request with a clear description
See CONTRIBUTING.md for guidelines and CHANGELOG.md for version history.
This project is licensed under the MIT License.











