An LLM-driven ML engineering agent. Point it at a dataset and a goal — it writes the model, tests it, trains it, evaluates it, and iteratively improves the result via an actor/critic hill-climbing loop.
Give it as little as one sentence:
task_description: Build an image classifier for this dataset.
target_metric: { name: accuracy, target_value: 0.85 }…and it'll explore architectures (CNN → ViT → classical CV + sklearn), augmentations, and hyperparameters until it clears your bar.
Intro & slide deck |
Easy install |
Greenfield AutoML run |
Walkthrough |
| Mode | Use when |
|---|---|
| Greenfield (default) | You have data but no code. Agent writes model.py, train.py, predict.py from scratch, then hill-climbs. |
Brownfield (mode = "existing") |
You have working code that needs improvement. Agent reads your existing baseline, then proposes/tests/keeps changes. |
Both modes share the same hill-climbing engine: propose → implement → test → train → evaluate, keep improvements via git, revert failures.
curl -fsSL https://raw.githubusercontent.com/cgpadwick/mle-beast/main/setup-mle-beast.sh | bashInteractive wizard. Asks whether you want Docker (recommended for first-timers, zero Python install) or Native (pipx install + BYO env). For Docker it generates a tailored docker-compose.yml + .env based on your answers and starts the container; for Native it runs pipx install mle-beast and hands off to mle-beast init.
Prefer to review before piping curl into bash? (Wise.)
curl -fsSL https://raw.githubusercontent.com/cgpadwick/mle-beast/main/setup-mle-beast.sh -o setup-mle-beast.sh
less setup-mle-beast.sh # read it
bash setup-mle-beast.sh # then runNon-interactive (CI / automation):
bash setup-mle-beast.sh --docker --yes \
--openrouter-key="$OPENROUTER_API_KEY"Once the container's up, the dashboard opens at http://localhost:8000. Click New Run and fill in:
| Field | Value |
|---|---|
| Workspace path | leave default (autogenerated) |
| Task description | build an image classifier for MNIST |
| Target | 0.98 easy first run |
| Dataset path | leave blank |
| Mode | Greenfield |
| Metric name | accuracy |
| Metric direction | Higher = better |
| Existing environment | leave blank |
| Set up workspace | ✓ check |
| Force CPU | leave unchecked uses GPU if one's visible to the container |
Hit Start and watch the dashboard. The agent will write model.py / train.py / predict.py, run the pytest smoke tests, train, evaluate, and hill-climb until it clears 0.98 accuracy — usually 1–3 iterations on MNIST.
# 1. Grab the compose file + env template
curl -O https://raw.githubusercontent.com/cgpadwick/mle-beast/main/docker-compose.yml
curl -O https://raw.githubusercontent.com/cgpadwick/mle-beast/main/.env.example
# 2. Drop in your LLM provider key
cp .env.example .env
# edit .env, set OPENROUTER_API_KEY=sk-or-... (or OPENAI_API_KEY, etc.)
# 3. Up and away
docker compose up
# → dashboard at http://localhost:8000The published image at ghcr.io/cgpadwick/mle-beast ships with ml-frameworks pre-cloned and the poetry wheel cache pre-primed, so the first greenfield run is fast (~30s of workspace setup instead of ~10 min of PyPI fetching). The default compose pulls :edge (continuous-delivery, follows every merge to main); switch to :0.1.0 / :latest once stable releases are tagged.
- Docker (or Docker Desktop on Mac/Windows). Linux: also install
nvidia-container-toolkitif you want GPU access; the image falls back to CPU automatically if the GPU isn't visible. - An LLM provider key — get one at openrouter.ai (recommended, one key for many models), or use OpenAI / a local OpenAI-compatible endpoint.
docker compose up creates two host directories next to your docker-compose.yml:
| Path | Holds |
|---|---|
./.mle-beast/ |
SQLite DB + global settings. Run history survives docker compose down. |
./workspaces/ |
Per-run workspace dirs (model.py, checkpoints, reports). cd in from your host to grab a model. |
The same image works with raw docker run if you prefer:
docker run --rm --gpus all -p 8000:8000 \
-e OPENROUTER_API_KEY=sk-or-... \
-v $(pwd)/.mle-beast:/home/mlebeast/.mle-beast \
-v $(pwd)/workspaces:/workspaces \
ghcr.io/cgpadwick/mle-beast:edgeUse this if you want to bring your own venv (brownfield mode) or iterate on the mle-beast source. The Docker bundle above is the easy "just try it" path; this is for when you want more control.
# Recommended — installs in an isolated venv, command goes on your PATH
pipx install mle-beast
# Or with the optional web dashboard
pipx install 'mle-beast[web]'
# Plain pip also works
pip install mle-beastDon't have pipx? python3 -m pip install --user pipx && python3 -m pipx ensurepath then open a new shell.
Working from a clone (e.g. contributing):
git clone https://github.com/cgpadwick/mle-beast.git
cd mle-beast
pip install -e '.[web]' # editable install — your changes are picked up livemle-beast itself just needs Python 3.10+. For greenfield runs (where mle-beast builds a workspace venv for you) it also needs:
- git — clones the ml-frameworks stack into each workspace
- poetry — installs ml-frameworks's pinned dependency lock into that workspace venv
The mle-beast init step below diagnoses these for you and offers to install poetry via pipx if it's missing. Brownfield / BYO-environment runs skip both — you bring your own venv.
After installing mle-beast, run mle-beast init in your project directory:
mkdir ~/my-mle-experiment && cd ~/my-mle-experiment
git init -q
mle-beast initThe init flow walks you through:
- Prereq check — verifies Python / poetry / git, offers to install poetry via pipx if it's missing.
- LLM provider — detects API keys already in your shell environment; if multiple are present, asks which to use. If none, prompts you to paste one in.
- Model picker — fetches the live catalog from your provider, prunes stale defaults, fuzzy-matches typos (
gpt-4o-miniagainst OpenRouter → corrected toopenai/gpt-4o-mini). - Scaffolding — writes
.env, ensures.envis gitignored, and drops anAGENTS.mdso coding agents (Claude Code, Cursor, Aider) can drive mle-beast on your behalf.
Then start the dashboard:
mle-beast # opens http://127.0.0.1:8000 in your browserOr jump straight into the CLI REPL:
mle-beast --no-webOr run a sample integration test end-to-end:
pytest tests/integration/test_shapes.py -m integration -v -sYou should see the agent discover the dataset, write a baseline classifier, train it, and iteratively improve it until accuracy clears 0.85.
mle-beast init --check # diagnose only; don't write any files
mle-beast init --yes # accept all defaults; no prompts (CI-friendly)
mle-beast init --no-validate-key # skip the live /models verification
mle-beast init --cwd PATH # scaffold into PATH instead of cwdGitSetup → Baseline → [Propose → Implement → Test → Train → Evaluate] × N → Done
↑__________________________________________________|
hill-climb loop
- Actors run an inner tool loop — the LLM picks tools (read/write file, run shell command, launch training, etc.) via a discriminated-union Pydantic model, eliminating tool hallucination.
- Critics are procedural: they run pytest, parse logs, check git state — and call the LLM once for feedback text. Critics don't use tools.
- Convergence:
max_steps(default 30) ormax_consecutive_failures(default 10), whichever fires first.target_metriclets the run exit early when the bar is cleared.
mle-beast init walks you through this interactively, but the underlying knobs (which init writes to .env for you) are:
# Pick a model. init writes this based on the live provider catalog.
export MLE_BEAST_MODEL=deepseek/deepseek-v4-flash
# Pin the provider explicitly when multiple keys are present.
# Without this pin, the resolution order is:
# LOCAL_LLM_BASE_URL > OPENROUTER_API_KEY > OPENAI_API_KEY
export MLE_BEAST_PROVIDER=openrouterShell environment variables always win over .env — .env only fills in gaps. So if you're testing a one-off model swap, just MLE_BEAST_MODEL=other/model mle-beast overrides what's in .env for that invocation.
Recommended models (good cost/quality for hill-climbing):
| Provider | Model | Notes |
|---|---|---|
| OpenRouter | deepseek/deepseek-v4-flash |
Cheap, surprisingly capable on ML tasks |
| OpenAI | gpt-5-mini |
Solid, more expensive |
| Local | Qwen/Qwen3-Coder-30B-A3B-Instruct |
Strong open-weight code model |
Every shell command and Python file the agent runs is pre-screened against a small policy at src/mle_beast/safety/policy.json. Blocked commands return a recoverable error string (ERROR: blocked by safety policy: <reason>) that the agent sees in its tool-result loop and self-corrects from — the run continues, the dangerous command never executes.
Blocked by default:
- Privilege escalation tokens:
sudo,su,doas,pkexec. Word-boundary matched, sopseudo-randomis fine. - Catastrophic patterns:
rm -rf /,rm -rf ~,mkfs,dd if=...of=/dev/,curl ... | bash, fork bombs,shutdown/reboot. - Workspace escapes: any destructive verb (
rm,mv,chmod,chown,tee, shell redirects) with an absolute or..-relative path that resolves outside the workspace. - Remote git mutations:
git push(any form). Local git operations —commit,status,log,checkout,diff,add— are always allowed because the hill-climb pipeline relies on them.
Deliberately NOT blocked:
- Read-only commands outside the workspace (
cat /etc/os-release,ls /usr/lib) — the agent legitimately needs to inspect system files sometimes. - Workspace-internal writes (
rm -rf checkpoints/old,mv train.py train.py.bak) — workspace cleanup is normal hill-climb behavior. - Writes to allow-listed paths outside the workspace:
/tmp,/var/tmp,~/.cache,~/.config/pypoetry,~/.mle-beast(these hold poetry caches, dataset downloads, mle-beast's own DB).
Open src/mle_beast/safety/policy.json and:
- Add a whole-word token (e.g. another privilege-escalation tool) to
blocked_tokens. - Add a Python regex (case-insensitive, matched against the full command line) to
blocked_patterns. - Add a path you DO want writable outside the workspace to
allow_paths_outside_workspace.
Restart any running mle-beast process to pick up the change — the policy is loaded once per process and cached.
Every block is appended to <workspace>/logs/safety.log as a tab-separated line:
2026-05-21T11:24:33 shell token:sudo sudo apt install foo
Useful when a run takes a recovery path you didn't expect — grep the file to see exactly what was blocked and why.
These guardrails catch the confused-LLM failure mode — an agent that hallucinates sudo apt install ... or accidentally points rm -rf at the wrong directory. They are NOT a sandbox: a deliberately adversarial model could bypass string-based checks (e.g. $(echo s)udo, base64 -d | sh). For threat models that include hostile prompts, container-based isolation is the correct answer; mle-beast doesn't ship that today.
Sample projects in tests/integration/*/project.yaml show the full schema. The minimum:
goals:
task_description: Build an image classifier for this dataset.
target_metric:
name: accuracy
target_value: 0.85
comparison: ">="Tip from running the suite: terse task descriptions outperform prescriptive ones. "Build an image classifier for this dataset" lets the agent explore architecture freely; "Build a CNN with Conv2d layers and data augmentation" anchors the search inside CNN-land and often fails to escape on small datasets. Be specific only when domain constraints actually matter.
The dashboard is the default when you run mle-beast — it'll open in your browser automatically.
mle-beast # starts dashboard + opens browser
mle-beast --no-browser # starts dashboard, you open the URL yourself
mle-beast --no-browser --port 9000 # custom port
mle-beast --no-web # drop into the CLI REPL insteadThe React frontend shows live run state, hill-climb experiments, per-experiment scores + commit SHAs, token usage, a DAG view of the pipeline, and a "Show Report" button that pops a self-contained HTML report (saved to <workspace>/reports/ for offline sharing or PDF export).
# Unit tests (no API key needed)
pytest tests/ -k "not integration"
# A single integration test
pytest tests/integration/test_churn_quick.py -m integration -v -sCI runs unit tests on Python 3.10 / 3.11 / 3.12 against every PR.
PRs welcome. See CONTRIBUTING.md — contributions use the Developer Certificate of Origin (just add -s to your commits).
Apache License 2.0. Third-party dependency attributions in THIRD_PARTY_LICENSES.md.
