Skip to content

shatianming5/PaperFarm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

280 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

πŸ§‘β€πŸŒΎ PaperFarm: Planting GPUs & APIs 🌱, Harvesting Papers & SOTAs 🌾

PyPI Downloads Python 3.10+ License: MIT GitHub stars

πŸ”¬ Point it at any repo β€” sow ideas, run experiments, and harvest better code autonomously

🌱 Sow ideas. 🚜 Run experiments. 🌾 Harvest evidence. πŸ“„

Quick Start Β· How It Works Β· Agents Β· TUI Dashboard Β· CLI Reference Β· Configuration Β· Examples


🌾 Key Features

  • πŸš€ One run Command: paperfarm run bootstraps a new workflow when .research/ is missing, or resumes an existing workflow when it already exists.

  • πŸ€– Multi-Agent Support: Works with Claude Code, Codex CLI, Aider, OpenCode, Kimi CLI, and Gemini CLI β€” auto-detects the first installed agent, or pick your own.

  • πŸ”¬ Scout β†’ Prepare β†’ Review β†’ Experiment Flow: AI agent analyzes your codebase, resolves install/data/smoke bootstrap steps, then runs the research-v1 loop β€” keeping what works, discarding what doesn't.

  • πŸ–₯️ Research Command Center TUI: A 3-tab Execution / Metrics / Logs dashboard with frontier table, parallel worker status, trend chart with results table, and color-coded event stream.

  • πŸ›‘οΈ Safety First: Every experiment is an isolated git commit. Failed experiments auto-rollback. Timeout watchdog, crash counter, and max-experiments limit keep things under control.

  • 🧭 Research-v1 Runtime: A single Scout -> Manager -> Critic -> Experiment loop keeps research state explicit and reviewable.

  • πŸ“‘ Headless Mode: Run without TUI β€” outputs structured JSON Lines to stdout, perfect for scripts, CI, or monitoring with external tools.

  • ⚑ Parallel Workers: Run experiments across multiple GPUs in isolated git worktrees β€” workers can't interfere with each other.


🌱 Quick Start

One-Command Workflow (Recommended)

pip install PaperFarm

cd your-project
paperfarm run

This launches a 4-phase flow:

Plant the first seed with paperfarm run, then let the field work:

  1. Scout β€” survey the field: analyze your codebase, search related work, and design evaluation metrics
  2. Prepare β€” prepare the soil: resolve a local Python env, install command, data/setup step, and a readiness smoke check
  3. Review β€” inspect the crop plan: review the analysis and prepare results in an interactive TUI, then confirm or edit the plan
  4. Experiment β€” plant, test, and harvest: Manager -> Critic -> Experiment runs the research loop autonomously, keeping what improves metrics

If you want to inspect exactly what run will use before it touches the repo, use:

paperfarm run --dry-run
paperfarm doctor

Headless Mode

Run without TUI β€” perfect for scripts, CI, or monitoring with external tools:

paperfarm run --mode headless --goal "reduce val_loss below 0.3" --max-experiments 20

Outputs structured JSON Lines to stdout, one event per line:

{"ts": "2026-03-10T12:34:56Z", "level": "info", "phase": "scouting", "event": "scout_started"}
{"ts": "2026-03-10T12:40:00Z", "level": "info", "phase": "preparing", "event": "prepare_step_completed", "step": "smoke", "status": "completed"}
{"ts": "2026-03-10T12:45:00Z", "level": "info", "phase": "experimenting", "event": "experiment_completed", "idea": "idea-001", "metric_value": 0.95, "experiment_num": 3, "max_experiments": 20}
{"ts": "2026-03-10T12:50:00Z", "level": "info", "phase": "done", "event": "limit_reached", "detail": "Max experiments (20) reached"}

Also writes to .research/events.jsonl for persistent logging. Interactive mode now writes the same canonical event stream, so TUI and headless share one runtime log.

Manual Step-by-Step

pip install PaperFarm

cd your-project
paperfarm init                      # Initialize .research/ directory
paperfarm run --agent claude-code   # Launch with TUI dashboard
# Go to sleep. Check results in the morning:
paperfarm status --sparkline
paperfarm results --chart primary

Try the interactive demo β€” no agent or API key needed:

paperfarm demo              # run in terminal
paperfarm demo --serve      # open in browser at http://localhost:8000
paperfarm demo --serve --port 9000

🚜 How It Works

Open Researcher generates a .research/ directory in your repo with everything needed for autonomous research.

πŸ“‚ .research/ Directory Structure
File Purpose
scout_program.md Scout agent instructions β€” project analysis phase
.internal/role_programs/*.md Internal runtime role prompts (manager / critic / experiment), auto-managed
config.yaml Mode, metrics, timeout, experiment limits, agent settings, and bootstrap.* overrides
project-understanding.md Agent fills: what the project does
research-strategy.md Agent fills: research direction and focus areas
literature.md Agent fills: related work and prior art
evaluation.md Agent fills: how to measure improvement
bootstrap_state.json Canonical install/data/smoke state for repo readiness
prepare.log Raw logs from env install, data prep, and smoke execution
idea_pool.json Projected experiment backlog with priority, status, and worker claim metadata
results.tsv Experiment log (timestamp, commit, metrics, status)
events.jsonl Canonical runtime event stream for research + control
research_graph.json Canonical hypothesis / experiment / evidence graph
research_memory.json Repo prior, ideation, and experiment memory
control.json Compatibility snapshot of pause/resume/skip state
activity.json Real-time agent status for TUI display
πŸ”„ The Scout β†’ Prepare β†’ Review β†’ Experiment Flow
Phase 0: Bootstrap
  └─ Auto-init .research/ if needed, load config

Phase 1: Goal Input
  └─ Optional research goal (TUI modal or --goal flag)

Phase 2: Scout Analysis
  β”œβ”€ Read codebase β†’ project-understanding.md
  β”œβ”€ Search related work β†’ literature.md
  β”œβ”€ Define strategy β†’ research-strategy.md
  └─ Design evaluation + bootstrap hints β†’ evaluation.md + config.yaml

Phase 3: Repository Prepare
  β”œβ”€ Resolve local Python env
  β”œβ”€ Resolve install_command / data_command / smoke_command
  β”œβ”€ Run install/data/smoke with logs in .research/prepare.log
  └─ Persist readiness state in .research/bootstrap_state.json

Phase 4: Human Review (TUI only, auto-confirmed in headless)
  β”œβ”€ Review all Scout outputs
  β”œβ”€ Review bootstrap resolution and readiness
  └─ Confirm, edit, or re-analyze

Phase 5: Research-v1 Loop
  β”œβ”€ Manager proposes/refines hypotheses and frontier rows
  β”œβ”€ Critic reviews experiment specs before execution
  β”œβ”€ Experiment agent implements, tests, and evaluates β†’ results.tsv
  β”œβ”€ Critic records evidence and claim updates into research_graph.json
  └─ Repeat until no runnable frontier remains or --max-experiments reached

Each experiment is a git commit. Successful experiments stay; failed ones are rolled back. Everything is logged in results.tsv.

🧰 Auto-Prepare Resolution Rules

paperfarm run now tries to make a local Python repo runnable before the research loop starts.

  • Python env priority: explicit bootstrap.python β†’ active virtualenv β†’ repo .venv β†’ auto-create .venv
  • Install priority: explicit bootstrap.install_command β†’ uv sync β†’ poetry install β†’ python -m pip install -r requirements.txt β†’ python -m pip install -e .
  • Data/setup priority: explicit bootstrap.data_command β†’ make setup|prepare|data|download-data β†’ scripts/prepare*.py / scripts/download*.py / data/*/prepare.py
  • Smoke priority: explicit bootstrap.smoke_command β†’ first runnable command block from .research/evaluation.md β†’ pytest -q β†’ make test

If a command cannot be resolved safely, run stops before the review/runtime stage and records the failure in .research/bootstrap_state.json.


πŸ›‘οΈ Field Safety & Runtime Controls

Feature Description
Isolated git commits Every experiment is a separate commit β€” nothing is lost
Auto-rollback Failed experiments are automatically rolled back via git reset
Timeout watchdog Kills experiments exceeding the configured time limit
Crash counter Auto-pauses after N consecutive crashes (default: 3)
Max experiments Stops after N experiments (--max-experiments or config.yaml)
Control plane Pause / resume / skip commands are event-backed in events.jsonl, with control.json kept as a compatibility snapshot
Failure memory Persistent ledger of past failures, ranked by recovery success
Phase gate In collaborative mode, pauses between phase transitions
Parallel workers Run experiments across multiple GPUs in isolated worktrees

πŸ€– Supported Agents

Agent Command Status
Claude Code --agent claude-code Supported
Codex CLI --agent codex Supported
Aider --agent aider Supported
OpenCode --agent opencode Supported
Kimi CLI --agent kimi-cli Supported
Gemini CLI --agent gemini-cli Supported

Auto-detection: If you don't specify --agent, Open Researcher finds the first installed one.

βš™οΈ Agent Configuration

Customize agent parameters in .research/config.yaml:

agents:
  claude-code:
    model: "claude-sonnet-4-5-20250514"   # override model
    allowed_tools: "Edit,Write,Bash,Read,Glob,Grep"
    extra_flags: ["--max-turns", "50"]
  codex:
    model: "gpt-5.2"                      # override default
    sandbox: "workspace-write"            # workspace-write | read-only | danger-full-access | full-auto
  aider:
    model: "gpt-4o"
    extra_flags: ["--no-git"]
  opencode:
    model: "openai/gpt-5"
    agent: "builder"
    extra_flags: ["--share"]
  kimi-cli:
    model: ""                       # optional model override
    agent: "okabe"                  # optional built-in agent profile
    agent_file: ""                  # custom agent file path (optional)
    extra_flags: ["--thinking"]
  gemini-cli:
    model: "gemini-3.1-pro"          # override default model
    sandbox: ""                       # optional sandbox mode
    extra_flags: []

πŸ“Š Interactive TUI Dashboard

The interactive TUI is a research command center built around the runtime state in .research/: frontier items, experiment results, worker status, and the event stream. Three tabs β€” Execution (frontier + workers), Metrics (summary stats + trend chart + results table), and Logs (color-coded event stream). Supports human-in-the-loop checkpoints β€” review hypotheses, override results, inject ideas, and edit goals without leaving the terminal.

Screenshots

Execution tab β€” frontier + parallel workers

Execution: frontier table sorted by priority with colored status, parallel workers running on multiple GPUs.

Metrics tab β€” experiment trend chart

Metrics: summary stats (kept/discarded/best/mean/latest), braille trend chart, and scrollable results table.

Logs tab β€” multi-round event stream

Logs: color-coded event stream with aligned prefixes β€” SKILL / DONE / W+ / W- / RES / WAIT / REVW / INJ / GOAL events across rounds.

Hypothesis review modal

Hypothesis Review: human-in-the-loop checkpoint β€” toggle, approve all, or reject frontier items before the next round.

Paused state

Paused: one-key pause/resume with bold indicator on the status bar.

Completed state

Completed: all phases checked off, final frontier state with best metric displayed.

πŸ–ΌοΈ More Screenshots

Result review modal

Result Review: override AI keep/discard decisions and add constraints for the next round.

Inject experiment modal

Inject Experiment: add a human-authored idea to the frontier with priority.

Goal edit modal

Edit Goal: update research constraints and direction mid-run.

Stress test β€” 10 rounds, 6 workers

Stress Test: round 10, 8 frontier items, 6 parallel workers across GPUs β€” scales smoothly.

Idle initial state

Idle: clean initial state before any research round starts.

Failed state

Failed: bold red indicator when the research loop encounters an unrecoverable error.

πŸ“‘ 3 Tabs & Keyboard Shortcuts

3 tabs:

  • Execution β€” Frontier table (sorted by priority, colored status) + Workers panel (GPU, frontier assignment, live status)
  • Metrics β€” Summary stats bar (kept/discarded/best/mean/latest + trend arrow) + braille trend chart + scrollable results table
  • Logs β€” Color-coded event stream: SKILL / DONE / OUT / W+ / W- / RES / WAIT / REVW / INJ / GOAL

Keyboard shortcuts: p pause, r resume, s skip, g edit goal, i inject idea, q quit.

πŸ”Ž Human-in-the-Loop Checkpoints
  • Hypothesis Review β€” After manager proposes ideas, review frontier items: toggle keep/reject, approve all, or skip.
  • Result Review β€” After experiments complete, review AI decisions (keep/discard) and override any result.
  • Inject Idea (i key) β€” Add a human-authored experiment to the frontier at any time.
  • Edit Goal (g key) β€” Update research constraints and direction mid-run.
  • Pause/Resume (p/r keys) β€” Temporarily halt the research loop.

🚜 Installation

Open Researcher supports Linux, macOS, and Windows. Python 3.10+ required.

Option A: pip install (recommended)

pip install PaperFarm

# Try the demo first (no agent or API key needed)
paperfarm demo                   # run in terminal
paperfarm demo --serve           # open in browser at http://localhost:8000

# Install browser support (optional)
pip install "PaperFarm[serve]"

# Then use it for real
cd your-project
paperfarm run

Option B: From source (for development)

🐧 Linux / 🍎 macOS / πŸ’» Windows
git clone https://github.com/shatianming5/PaperFarm.git
cd PaperFarm
make dev    # install with dev dependencies
make test   # run tests
make test-cov      # run tests with coverage gate (>=75%)
make lint   # run linter
make package-check # build wheel + install + CLI smoke test
make ci     # full local CI: lint + test + coverage + package smoke

πŸ–₯️ CLI Reference

All commands: paperfarm <command>

⚑ Core Commands
Command What It Does
run Primary command: bootstrap if needed, otherwise run the existing workflow
run --mode headless --goal "..." --max-experiments N Headless JSON Lines mode
run --workers N Set experiment worker count for serial or parallel execution
init [--tag NAME] Initialize .research/ directory
demo Try the TUI with sample data (no agent needed)
demo --serve [--port N] Serve the demo TUI in a browser (requires PaperFarm[serve])

Hidden compatibility alias: start still works for older scripts, but it is deprecated. Use run.

πŸ“ˆ Monitoring & Results
Command What It Does
status [--sparkline] Show experiment progress
results [--chart primary] [--json] Print results table or chart
logs [--follow] [--errors] View agent logs
export Export markdown report
πŸ’‘ Idea Management
Command What It Does
ideas list Inspect the projected backlog currently derived from research_graph.json
ideas add "description" Compatibility command that now refuses mutation under research-v1
ideas delete IDEA_ID Compatibility command that now refuses mutation under research-v1
ideas prioritize Compatibility command that now refuses mutation under research-v1
πŸ”§ Utilities & Diagnostics
Command What It Does
config show View/validate configuration
doctor Health check environment

βš™οΈ Configuration

Edit .research/config.yaml:

πŸŽ›οΈ Full Configuration Reference
mode: autonomous              # autonomous | collaborative

experiment:
  timeout: 600                # seconds per experiment before kill
  max_consecutive_crashes: 3  # pause after N consecutive crashes
  max_experiments: 0          # 0 = unlimited; set to N to stop after N experiments
  max_parallel_workers: 0     # 0 = auto (one per GPU), 1 = serial
  worker_agent: ""            # agent for sub-workers (default: same as master)

metrics:
  primary:
    name: ""                  # filled by agent (e.g., "val_loss")
    direction: ""             # higher_is_better | lower_is_better

environment: |
  # Free-form notes for agents. Runtime execution uses bootstrap.* below.

bootstrap:
  auto_prepare: true          # run install/data/smoke before review/runtime
  working_dir: "."            # relative to repo root
  python: ""                  # explicit python path if needed
  install_command: ""         # explicit dependency install command
  data_command: ""            # explicit dataset/setup command
  smoke_command: ""           # explicit readiness check command
  expected_paths: []          # files/dirs that data/setup must materialize
  requires_gpu: false         # fail prepare if GPU is required but unavailable

research:
  protocol: research-v1
  manager_batch_size: 3
  critic_repro_policy: best_or_surprising

memory:
  ideation: true
  experiment: true
  repo_type_prior: true

roles:
  scout_agent: ""             # optional override
  manager_agent: ""           # optional override
  critic_agent: ""            # optional override
  experiment_agent: ""        # optional override

gpu:
  remote_hosts: []            # optional remote GPU allocation hosts

agents:                       # per-agent overrides (optional)
  claude-code:
    model: ""
    allowed_tools: "Edit,Write,Bash,Read,Glob,Grep"

🏑 Project Structure

🎯 Core System
Module Description
cli.py CLI entry point, all commands (Typer)
run_cmd.py Unified workflow entrypoint: bootstrap flow + existing-workflow runner
headless.py Headless mode (JSON Lines output)
init_cmd.py Initialize .research/ directory
config.py Configuration parsing
πŸ€– Agent Adapters (agents/)
Module Description
base.py AgentAdapter abstract base class
claude_code.py Claude Code adapter
codex.py Codex CLI adapter
aider.py Aider adapter
opencode.py OpenCode adapter
kimi.py Kimi CLI adapter
gemini.py Gemini CLI adapter
πŸ“Š TUI Components (tui/)
Module Description
app.py Main Textual application for the 4-tab research command center
widgets.py Command, execution, logs, docs, lineage, frontier, and detail drawer widgets
view_model.py TUI-specific aggregation layer from graph / memory / results / events into renderable state
review.py Post-Scout review TUI
modals.py Modal dialogs (AddIdea, GPUStatus, Log)
tui_runner.py Shared Textual session lifecycle for bootstrap and existing-workflow entrypoints
styles.css CSS styling
βš™οΈ Runtime Engine
Module Description
idea_pool.py Serial idea backlog plus parallel claim handling for workers
research_loop.py Shared Scout β†’ Manager β†’ Critic β†’ Experiment core loop
research_events.py Typed event contract shared by TUI and headless
event_journal.py Shared JSONL journal for runtime and control events
control_plane.py Runtime control (pause/resume/skip)
failure_memory.py Failure memory ledger (categorize, improve fixes)
worker.py Parallel worker management (multi-GPU)
worktree.py Git worktree management (worker isolation)
gpu_manager.py GPU allocation (local/remote)
watchdog.py Timeout watchdog (kill runaway experiments)
crash_counter.py Crash counter (auto-pause after N failures)
phase_gate.py Phase gate (collaborative mode confirmation)
activity.py Activity monitor (real-time agent status)

🌽 Examples

See examples/ for complete setups:

  • nanoGPT β€” Reduce validation loss in character-level language model training
  • Liger-Kernel β€” Optimize Triton GPU kernels
  • HF GLUE β€” Improve HuggingFace Transformers fine-tuning
  • CIFAR-10 Speedrun β€” Maximize CIFAR-10 image classification accuracy
  • YOLO Tiny β€” Optimize YOLOv8 object detection on COCO8
  • Whisper Fine-tune β€” Reduce Whisper speech recognition word error rate
  • CartPole RL β€” Maximize CartPole-v1 reinforcement learning reward
  • Code Perf β€” Optimize Python JSON parser throughput (non-ML)

πŸ§‘β€πŸŒΎ Contributing

Contributions are welcome! Please follow these steps:

  1. Open an issue to discuss the proposed change
  2. Fork the repository and create your feature branch
  3. Submit a pull request with a clear description

See CONTRIBUTING.md for guidelines and CHANGELOG.md for version history.

πŸ“„ License

This project is licensed under the MIT License.


Star History

Star History Chart

About

Let AI agents run experiments in any repo while you sleep.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages