Run several LLM agents (claude, codex, gemini) on the same
task in parallel — each in its own docker container with its own
subscription auth — then have other LLM agents read the
outputs blind (under A / B / C labels), score them against
your rubric, and write reviews.
You get a leaderboard.md plus a corpus of N divergent solutions
to one brief. No API bills: it goes through your Claude Pro
/ ChatGPT Plus / Gemini Advanced subscriptions.
«multicooker»: one task, several dishes cook in parallel in their own pots; you compare what came out of each.
🇷🇺 Russian version:
README.ru.md.
When a task is underspecified — design, copy, refactoring with architectural choice, code review — there is no single "correct" answer. Any model will fill in the gaps from the brief itself, and what it fills in is the interesting part. A single run through a single model doesn't show this; you only see one interpretation and assume it's "the answer".
multicooker gives you a corpus of divergent interpretations of the same brief in one shot. Useful when:
- You're picking between models for a recurring task (refactoring, design, doc writing, code review) and tired of deciding by vibes.
- You want to see where a brief is underspecified — disagreement between models highlights exactly those spots.
- You're doing design or copy work and want three takes from three different "heads" instead of one.
- You're studying how much models agree with each other on open tasks (often: not much).
┌─────────────────────────────┐
│ cooks/260516-task/ │
│ BRIEF.md JUDGE_BRIEF.md │
│ brief.yaml raw/ │
└──────────────┬──────────────┘
│ multicooker cook
┌───────────────────────────┼───────────────────────────┐
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ claude │ │ codex │ │ gemini │
│ container │ (parallel)│ container │ (parallel) │ container │
│ net-A │ │ net-B │ │ net-C │
│ /work/... │ │ /work/... │ │ /work/... │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ out/ │ out/ │ out/
└───────────────────────────┼───────────────────────────┘
▼
┌─────────────────────────┐
│ anonymize → A/B/C │
│ mapping stays on host │
└──────────────┬──────────┘
│ multicooker judge
┌─────────────────────┼─────────────────────┐
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ judge-1 │ │ judge-2 │
│ (claude/codex/ │ scores everyone except │ (different │
│ gemini) │ its own flavor │ flavor) │
└────────┬────────┘ └────────┬────────┘
│ scores.json + review.md │
└─────────────────────┬─────────────────────┘
▼ multicooker report
┌──────────────────┐
│ leaderboard.md │
└──────────────────┘
The key properties:
- Isolation. Each participant runs in its own container on
its own bridge network — can't see the other participants, the
judge brief, or the
A↔flavormapping. - Parallelism. All participants start at the same time. One being rate-limited doesn't block the others.
- Anonymization. Judges only see
A/B/Cwith no model names. The mapping lives only on the host. - Anti-self-judge. A judge never scores submissions from its own flavor — claude doesn't judge claude's output.
- No API keys. Subscription credentials (
Claude Pro/ChatGPT Plus/Gemini Advanced) are passed into containers via bind-mount or named volume, read-only. Seedocs/auth.md.
git clone https://github.com/faeton/multicooker
cd multicooker
pip install -e .Requirements:
- macOS or Linux host with a running docker daemon. On macOS, OrbStack is the recommended runtime — noticeably faster startup, lower idle CPU, and friendlier resource handling than Docker Desktop. Docker Desktop and colima also work.
- Python 3.10+.
- At least one of these CLIs installed and logged in:
claude(claude /login),codex(codexto log in),gemini(geminito log in). Only the flavors you actually want to run.
Want to try the pipeline without subscription creds? There's a
dummy flavor — see examples/hello-task.
The fastest way to use multicooker is to fire up an LLM agent
inside the repo and let it scaffold and run the cook for you.
The repo ships with a CLAUDE.md (and an AGENTS.md symlink for
codex / gemini) that already explains the project, the shape of a
cook, and the rule that the rubric stays in sync between
brief.yaml and JUDGE_BRIEF.md. Any agent reading it can do the
boring part for you.
git clone https://github.com/faeton/multicooker && cd multicooker
pip install -e .
claude # or: codex, or: gemini — they all read AGENTS.mdThen describe what you want in plain language:
"Set up a cook called
landing-redesign. Compare claude / codex / gemini on a single-file HTML hero for [product]. Judge on visual-hierarchy, typography, color-discipline, content-fit, polish. References are at~/work/brand/notes.mdand~/work/brand/voice.md. Then run cook + judge + report."
The agent reads CLAUDE.md and examples/design-landing/ as
templates, drafts your BRIEF.md / JUDGE_BRIEF.md / brief.yaml,
copies the refs into raw/, kicks off multicooker cook, waits
for it to finish, then runs judge and report. You read the
leaderboard.
Iterating is the same conversation:
"Feedback for everyone: too much whitespace, push for denser layout. Specifically for
claude: keep the color palette but tighten the type scale. Refine."
Or — start a new cook reusing the same reference material (different task, same brand assets):
"Same refs as the previous cook. New brief: a 3-frame onboarding sequence instead of a single landing. Judge the same dimensions plus story-clarity. Run it."
This is the canonical workflow. The manual flow below is useful for understanding the moving parts, but it's not how you'd typically use the tool day-to-day.
# 1. Preflight — docker, compose, creds for each flavor
multicooker doctor
# 2. Scaffold (name is auto-prefixed with today's date → 260509-my-task)
multicooker new my-task
# 3. Describe the task
cd cooks/260509-my-task
$EDITOR BRIEF.md # what participants must do
$EDITOR JUDGE_BRIEF.md # how judges will score
$EDITOR brief.yaml # participants, judges, timeout, rubric
cp ~/some-reference.* raw/ # reference materials (mounted RO)
# 4. Cook — all participants in parallel, each in its own container
multicooker cook 260509-my-task
# 5. Judge — blind: judges only see A/B/C labels
multicooker judge 260509-my-task
# 6. Summary → leaderboard.md
multicooker report 260509-my-task
cat cooks/260509-my-task/leaderboard.mdTwo ready-to-run examples in the repo — copy and go:
-
examples/hello-task— sanitized smoke test on thedummyflavor, no LLM creds required. ~10 seconds from start to leaderboard. Run it once to see the shape of a cook on the simplest possible task. -
examples/design-landing— a real design task: each model designs its own landing page formulticooker. Three HTML files you then compare side-by-side in a browser. More on this below.
The most illustrative use case is tasks where there's no right answer but there are quality criteria. Design, copy, naming, architectural essays. Here models diverge not because one is buggy but because they hold different "aesthetic beliefs", and comparison becomes substantive.
examples/design-landing is a working template for this kind of
cook. Brief: "design a landing page for multicooker, single-file
HTML, no build step". When you open the three index.html files
side by side, you typically see:
- Palette. One model commits to strict monochrome; another scatters six accent colors and doesn't quite know what to do with them; another defaults to dark mode.
- Typography. Someone reaches for the system stack; someone
pulls Inter from Google Fonts; someone leaves the default
serif— and the hero blocks read completely differently as a result. - Density. One packs features into a three-column grid with small text; another goes for one big half-screen block.
- Content fit. Someone quotes
raw/product.mdverbatim; someone reimagines the product according to their own theories of what a "proper landing" should be (thecontent-fitdimension in the rubric exists to catch this). - Polish. Hover states, spacing rhythm, code-block styling, footer treatment — small decisions that separate "draft" from "shipped".
The rubric in examples/design-landing/JUDGE_BRIEF.md
scores on visual-hierarchy / typography / color-discipline / content-fit / polish. Two judges of different flavors score
blindly — and they often disagree with each other. That's a useful
signal: on design tasks, judge disagreement means there's no
"winner on points", just three different directions, and you pick
with your eyes.
# Run the design example (requires claude/codex/gemini logins)
multicooker new landing --participants claude,codex,gemini
TASK=$(basename "$(ls -d cooks/*-landing | tail -1)")
cp examples/design-landing/{BRIEF.md,JUDGE_BRIEF.md,brief.yaml} cooks/$TASK/
cp examples/design-landing/raw/* cooks/$TASK/raw/
multicooker cook $TASK
multicooker judge $TASK
multicooker report $TASK
# Open all three variants side by side, plus the leaderboard
open cooks/$TASK/out/*/index.html
cat cooks/$TASK/leaderboard.mdThis template adapts to any design task — SVG logo, README header,
email template, dashboard mockup. You only need to rewrite
BRIEF.md for your output and tweak the rubric dimensions
(brand-fit, accessibility, density, motion-restraint —
anything, as long as the names match between brief.yaml and
JUDGE_BRIEF.md). See
examples/design-landing/README.md
for the full adaptation guide.
$EDITOR cooks/260509-my-task/FEEDBACK.md # general feedback
$EDITOR cooks/260509-my-task/FEEDBACK_claude.md # per-participant (optional)
multicooker refine 260509-my-task # round N+1 on top of previous out/
multicooker judge 260509-my-task
multicooker report 260509-my-taskPrevious rounds are preserved in rounds/<N>/ — nothing is lost.
multicooker diff <task> shows what moved at file level between
two rounds — useful for spotting which model actually took the
feedback to heart vs which one just rephrased the previous answer.
multicooker new comparison \
--participants claude-a=claude,claude-b=claude,codex,geminiPer-participant model selection lives in brief.yaml:
participants:
- { name: claude-sonnet, flavor: claude, model: claude-sonnet-4-6 }
- { name: claude-opus, flavor: claude, model: claude-opus-4-7 }
- { name: codex }Useful for, e.g., pitting sonnet against opus on the same task
— two horses of the same flavor under different names, with
different models.
- One docker compose project per cook (
mc-<task>). - Each participant is in its own container on its own bridge
network (
net-participant-<name>); they don't see each other via DNS/IP. - Subscription creds are snapshotted into
cooks/<task>/.auth/<flavor>/(mode0600,.gitignore'd) and bind-mounted RO only into the corresponding container. - After the cook, sealed
out/is anonymized intoA/B/C/…before judging. TheA↔flavormapping lives on the host only, never goes into judge containers. - Egress to the internet is open. Sandbox = container, not network.
Threat model:
docs/security.md.
The long version: HOWTO.md. Internals:
docs/orchestration.md,
docs/auth.md,
docs/lifecycle.md.
| Command | What it does |
|---|---|
multicooker new <task> [--participants ...] |
Create a cook from templates. |
multicooker doctor [<task>] |
Preflight: docker, compose, creds, Dockerfiles, base images. |
multicooker build-base [<flavor>...] |
Build the shared base image (auto-built before the first cook). |
multicooker cook <task> |
Launch all participants in parallel. |
multicooker refine <task> |
Round N+1 with feedback on top of previous out. |
multicooker judge <task> |
Anonymized scoring by all judges. |
multicooker rejudge <task> |
Re-run judging (e.g. after editing JUDGE_BRIEF.md). |
multicooker report <task> |
Roll-up into leaderboard.md. |
multicooker diff <task> |
File-level diff between two refine rounds. |
multicooker add-participant <task> NAME[=FLAVOR] |
Add another participant to an existing cook. |
multicooker clean [<task>] [--all] |
compose down -v --rmi local + remove .auth/. |
v0.2. Tested on macOS with OrbStack and Docker Desktop. Linux
should work;
claude creds on darwin come from Keychain, on Linux from
~/.claude/.credentials.json.
Bugs → GitHub issues. Security: SECURITY.md.
MIT.