Standalone Python SDK for autonomous experiment loops.
Try ideas, keep what works, discard what doesn't, never stop.
Inspired by karpathy/autoresearch and the OpenClaw AutoResearch plugin.
- 🔄 Experiment Loop Engine — init → run → log → improve → repeat
- 📊 Statistical Confidence — MAD-based noise floor to distinguish real improvements from noise
- 📝 Ideas Backlog — Failed experiments automatically generate ideas for future exploration
- 📈 State Persistence — JSONL results log, survives restarts
- 🛠️ CLI — Full command-line interface with rich output
- 🐍 Pure Python — Zero external dependencies required (only
rich+click) - 📦 OpenClaw Compatible — Same JSONL format as the OpenClaw AutoResearch plugin
pip install autoresearch# Initialize experiment
autoresearch init --name "optimize-latency" --metric latency_ms --direction lower --unit ms
# Run baseline
autoresearch run -c "./benchmark.sh"
# Log result
autoresearch log -d "Baseline measurement"
# Make changes, run again
autoresearch run -c "./benchmark.sh"
autoresearch log -d "Added caching layer"
# Discard bad idea
autoresearch run -c "./benchmark.sh"
autoresearch log -s discard -d "Thread pool didn't help" -i "Try async I/O instead"
# Check status
autoresearch statusfrom autoresearch import AutoResearch
# Initialize
ar = AutoResearch(
name="optimize-latency",
metric="latency_ms",
direction="lower", # lower is better
unit="ms",
)
ar.init()
# Run baseline
result = ar.run("./benchmark.sh")
print(f"Passed: {result.passed}, Duration: {result.duration_seconds:.1f}s")
ar.log("Baseline measurement")
# Run experiment
result = ar.run("./benchmark.sh")
if result.passed:
ar.log("Added Redis caching")
else:
ar.log("Crashed", status="crash")
# Discard with idea
result = ar.run("./benchmark.sh")
ar.discard_with_idea("Thread pool overhead", "Try event loop instead")
# Check confidence
print(f"Confidence: {ar.confidence}")
print(f"Best: {ar.best_metric} vs Baseline: {ar.baseline_metric}")
# Export results
import json
print(json.dumps(ar.get_status(), indent=2))SCAN → HYPOTHESIS → EXPERIMENT → MEASURE → INTEGRATE → repeat
- SCAN — Identify optimization opportunities
- HYPOTHESIS — Form a testable hypothesis
- EXPERIMENT — Make the change and run the benchmark
- MEASURE — Parse metrics, compare to baseline
- INTEGRATE — Keep if better, discard if worse, log idea for future
AutoResearch uses Median Absolute Deviation (MAD) as the noise floor:
| Confidence | Meaning |
|---|---|
| ≥ 2.0x | Improvement is likely real ✅ |
| 1.0-2.0x | Above noise but marginal |
| < 1.0x | Within noise, re-run to confirm ❓ |
This prevents false positives from natural variance.
Your benchmark script should output METRIC name=value lines:
#!/bin/bash
set -euo pipefail
# Run your benchmark
result=$(./my_benchmark)
# Output metrics
echo "METRIC latency_ms=$result"
echo "METRIC throughput_qps=$(echo '10000 / ' $result | bc)"All results are stored in autoresearch.results.jsonl:
{"type": "config", "name": "optimize-latency", "metricName": "latency_ms", "metricUnit": "ms", "bestDirection": "lower"}
{"run": 1, "commit": "abc1234", "metric": 142.5, "metrics": {"latency_ms": 142.5}, "status": "keep", "baseline": true, "description": "Baseline", "timestamp": 1713888000, "segment": 0, "confidence": null}Discarded experiments append to autoresearch.ideas.md:
- Try async I/O instead of thread pool
- Consider connection pooling for database
- Pre-compute lookup table for hot pathautoresearch/
├── src/autoresearch/
│ ├── __init__.py # Public API
│ ├── core/
│ │ ├── engine.py # Experiment loop engine
│ │ ├── confidence.py # Statistical confidence (MAD)
│ │ └── metrics.py # METRIC name=value parser
│ ├── cli.py # Rich CLI interface
│ ├── utils/
│ └── metrics/
├── tests/
│ └── test_autoresearch.py
├── pyproject.toml
└── README.md
| Feature | OpenClaw Plugin | Python SDK |
|---|---|---|
| Experiment loop | ✅ (via tools) | ✅ (API + CLI) |
| Confidence scoring | ✅ (MAD) | ✅ (MAD, same algorithm) |
| Ideas backlog | ✅ | ✅ |
| State persistence | ✅ (JSONL) | ✅ (JSONL, compatible) |
| Git integration | ✅ (auto checkout) | |
| Session management | ✅ (OpenClaw sessions) | ❌ (standalone) |
| Rich output | ✅ | ✅ (click + rich) |
| OpenClaw required | ✅ | ❌ |
| Python importable | ❌ | ✅ |
MIT — Use it however you want.
Jose Manuel Sabarís García (@llllJokerllll)
- karpathy/autoresearch — Original concept
- OpenClaw AutoResearch Plugin — Confidence algorithm and JSONL format