Detailed description of how multicooker works, what it does, why it does it this way, and which rules you can't break. If you just want to run it — see README.md. This file is for understanding the internals and for extending it.
- Why this exists at all
- Mental model
cookfolder structure- What happens in
multicooker cook - What happens in
multicooker judge - Rules (which are easy to break)
- Docker-mode (the only one)
- Auth and cost
- What to do when something breaks
- Extensions and next steps
- Lessons from reproxy/arena
Sometimes a task is so underspecified that there's no single "correct" solution. You want to see how different LLMs interpret it and what comes out — not so much to "declare a winner" as to get a corpus of 3+ diverging solutions to the same task. This gives you:
- ideas you wouldn't have come up with yourself;
- understanding of where LLMs agree and where they diverge (divergences usually highlight where the task is underspecified);
- an honest, beyond-marketing sanity-check of which model handles your particular kind of task better.
The ancestor is reproxy/arena/ (now on branch archive/arena),
which ran claude/codex/gemini in a three-round tournament over
network scenarios and from which the v0.1.0 release was assembled.
Lessons squeezed out of that experience are described at the end.
one task
│
┌───────────────┼───────────────┐
▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐
│ claude │ │ codex │ │ gemini │ ← parallel, isolated
│ /work/ │ │ /work/ │ │ /work/ │
└───┬────┘ └───┬────┘ └───┬────┘
│ raw/ (read-only, shared) │
└────────────┬───────────────────┘
▼
sealed snapshots
│
┌───────────┴────────────┐
▼ ▼
┌────────┐ ┌────────┐
│ judge │ │ judge │ ← scoring panel
│ claude │ │ gemini │ (anonymized A/B/C)
└───┬────┘ └───┬────┘
│ │
└─────────┬─────────────┘
▼
leaderboard.md
Key properties:
- Parallelism. All participants run at the same time (threads on the host, or containers in docker-mode). Nobody waits for anyone.
- Isolation. Each one sees only its own
work/<name>/plus a sharedraw/(read-only). They don't see each other. - Judge anonymization. Participants reach the judge as
A,B,C, ... — it doesn't know which model wrote what. The mapping is recovered only in the final report. - Anti-self-judge. If the judge is the same flavor as one of the participants, multicooker prints a WARN (but still judges). Anonymization already strips some of the bias; for hard isolation add a judge of a third flavor (e.g. another codex-judge) that isn't among the participants.
After multicooker new my-task:
cooks/my-task/
├── BRIEF.md # you write the task for participants here
├── brief.yaml # participants, timeouts, rubric
├── JUDGE_BRIEF.md # judge instructions + rubric
├── raw/ # you drop reference materials here
│ └── .gitkeep
└── work/ # participant work folders (created empty)
├── claude/
├── codex/
└── gemini/
After multicooker cook my-task the following are added:
cooks/my-task/
├── RUN.json # run metadata
├── RUN_RESULT.json # participant statuses
├── work/<p>/BRIEF.md # symlink to ../../BRIEF.md
├── work/<p>/raw/ # symlink to ../../raw
├── work/<p>/out/ # participant writes its result here
├── logs/<p>/<flavor>.stdout.log # raw CLI stdout
└── logs/<p>/<flavor>.stderr.log # raw CLI stderr
After multicooker judge my-task:
cooks/my-task/judging/
├── _inbox/<p>/ # frozen copy of work/<p>/
├── _judge_input/ # anonymized input for the judges
│ └── submissions/{A,B,C}/
├── _logs/<judge-name>/ # judge CLI logs
├── _mapping.json # A→claude, B→codex, ...
├── <judge-name>/scores.json # raw scores (by A/B/C)
├── <judge-name>/scores_deanon.json # with names revealed
└── <judge-name>/review.md # textual justification
After multicooker report my-task:
cooks/my-task/leaderboard.md
Pseudocode:
for participant in brief.participants:
setup work/<participant>/ # folder + symlink BRIEF.md + symlink raw/
spawn thread:
run host CLI(<flavor>) in work/<participant>/ with prompt = brief
capture stdout/stderr to logs/<participant>/
on rate-limit: record evidence, return (don't sleep — others are working)
on success/timeout: copy work/<participant>/ → judging/_inbox/<participant>/
join all threads
write RUN_RESULT.jsonSpecific technical nuances:
A 2-second pause between launching participants. Otherwise all three CLIs hit auth-refresh at the same time, and the Keychain under load can return an error.
Each CLI has its own "you hit the limit" patterns (see
multicooker/runner_common.py:_RL_PATTERNS). If they appear in the
tail of stdout/stderr — the participant is marked rate_limited
with a pointer to the specific evidence line. We don't block the
others — claude and gemini have independent limits, codex may
die, claude and gemini will finish normally.
On a Mac caffeinate -dimsu -w <pid> prevents the system from
sleeping while the CLI is running. But if the laptop is on a closed
lid — caffeinate doesn't help. Then we compare time.time() (wall)
and time.monotonic() (which pauses during sleep on macOS), and
if the difference is > 60s — we assume the laptop slept, and retry
once (API connections almost certainly dropped).
One of the arena bugs: the claude CLI has a variadic --add-dir,
which swallows the positional prompt as another path. So the
prompt goes BEFORE --add-dir:
claude --print "<prompt>" --add-dir /workand not
claude --add-dir /work --print "<prompt>" # ← prompt gets lostThis is baked into
templates/cook/participants/claude/entrypoint.sh — for the
canonical argv order per flavor see docs/orchestration.md.
The participant must put its result under ./out/. This is a
convention spelled out in the template prompt. The judge looks
there too. If a participant ignored it and dumped files at the
root — the judge will see them anyway (it sees the whole worktree
except symlinks).
participants = brief.participants
mapping = {A: claude, B: codex, C: gemini} (random shuffle)
copy each work/<participant>/ → _judge_input/submissions/<letter>/
for judge in brief.judges:
warn if judge.flavor == any participant.flavor # anti-self-judge (advisory only)
copy JUDGE_BRIEF.md + raw/ + submissions/ into a fresh _work-<judge>-XXX/
run host CLI(<judge.flavor>) in that work-dir
expect ./outbox/scores.json + ./outbox/review.md
deanonymize scores using mapping
write deanon to judging/<judge-name>/scores_deanon.jsonArena bug #1: the judge received ./inbox and ./outbox as
symlinks to the real folders. CLI sandboxes (claude --add-dir <work>) allow reads/writes only inside their own work-dir. A
symlink pointing outside resolves to a path that isn't in the
allowlist, and Read/Bash/Write silently refuse. The result:
97% of scores were placeholders.
The fix: no symlinks. JUDGE_BRIEF.md, raw/, submissions/ are
copied into the judge's work-dir (not symlinked). After the
run the contents of work/outbox/ are copied back into
judging/<judge-name>/.
If the judge sees "submission claude/" — the claude-judge will tend to score "its own" higher (or the opposite, lowballing to compensate). Anonymization plus the anti-self-judge rule remove the crudest sources of bias.
Be aware: bias is not fully removed. claude vs gemini code style is recognizable. If you want more — add a third judge (any anti-bias measure benefits from larger N), and/or ask an agent wrapper to paraphrase outputs before judging (not implemented in v0.1, on the TODO list).
-
Don't let the judge read the participant's stderr.log. In stderr the CLI often puts something like "Claude is thinking..." — instant deanon. We put only the participant's work-tree into judging/_inbox/, without logs/.
-
JUDGE_BRIEF.md and BRIEF.md must share the same rubric. If you add a dimension to BRIEF.md and forget JUDGE_BRIEF.md, the judge will score something other than what was promised in the brief.
-
Don't edit work/
/ after cook. If you want to "help" a participant — that's not its result anymore. If you want to give everyone a hint — update BRIEF.md or raw/ and cook again.
-
raw/ — read-only by convention. Technically the filesystem lets the participant write there (we use a symlink). Don't trust it: if the task is sensitive, after cook do
diff -r raw/ <expected>/and confirm the participant didn't change it. Or chmod 555 raw/ before cook. -
API limits are unpredictable. Don't run an overnight cook without
RUN_RESULT.jsonpost-processing. In the morning check: were any participantsrate_limited? If yes, and they matter to you — plan a re-run (not implemented in v0.1: run manually after quota recovery).
Starting from v0.2 multicooker only runs in docker-mode. Host-mode
and host_runner.py have been removed — if something broke without
them, fix it in docker-mode.
- Each participant and each judge — its own container on its own
bridge network (
net-participant-<name>/net-judge-<name>). No inter-container DNS/IP visibility within a cook. - Egress to the internet is open. The sandbox is the container,
not the network. If a particular cook needs a strict allowlist —
drop in a local
compose.override.yaml. - Subscription credentials (Claude Pro / ChatGPT Plus / Gemini
Advanced) are snapshotted into
cooks/<task>/.auth/<flavor>/(mode0600,.gitignore) and bind-mounted RO into the corresponding container. API keys are not needed, and there is no silent fallback to an API key. Seedocs/auth.md. - Permission-bypass flags (
--dangerously-skip-permissions,--yolo,--dangerously-bypass-approvals-and-sandbox) are mandatory inside the container: without them the CLIs hang on approval prompts. Safe, because the container contains them. - Shared base images (
mc-base-<flavor>:latest) install the heavy stuff (npm i -g <cli>), and the cook Dockerfile is shortened toFROM mc-base-<flavor>+ entrypoint. Cook image build is ~1 sec instead of 2-3 min.multicooker build-basebuilds them manually; cook / refine / judge callbase_images.ensure_built()themselves, so it's transparent to the user.
Threat model and what exactly the container protects: see
docs/security.md.
- Subscription-only auth: Claude Pro $20/mo, ChatGPT Plus $20/mo, Gemini Advanced $20/mo. Enough for several tasks a day; limits are low — a typical cook with 3 participants burns ≈ 30k–200k tokens per participant.
- API keys are not used and not wired in as a fallback: if a
subscription cred is unavailable,
multicooker doctor/cookfail loudly with a remediation message, rather than silently falling back to a paid API.
Rough estimate:
participants × tokens_per_participant × $/token
+ judges × tokens_per_judge × $/token
For a typical "write a 2-page essay" task: ~$0.30–$1.50. For "rewrite this repository": ~$5–$30 (depends on size).
v0.1 has no cost-tracker. If you need one — look at prompt+completion in the subscription CLI logs or in the API ledger. v0.2 wants an automatic ledger (one of the TODOs).
brew install claude-code # or the official anthropic installer
Same for codex and gemini. If you don't need a particular
participant — remove it from brief.yaml before cook.
Look at cooks/<name>/judging/_logs/<judge>/<flavor>.stdout.log.
Most common cases:
- the judge hit its own rate-limit;
- the judge considered the task too ambiguous and asked for clarification (visible in its output);
- the judge tripped over the symlink bug (shouldn't happen with this version of the judge — we copy, we don't symlink).
Most often the rubric is unclear to the judge. Re-read your
JUDGE_BRIEF.md with the eyes of a disinterested person. If a
dimension says "quality" without a definition — the judge scores
at random. The more concrete the phrasing ("did the answer
reference all 3 source documents?"), the more stable the scores.
Each CLI is multi-threaded on its own. Three parallel claudes can saturate a laptop. Lower parallelism:
participants:
- name: claude
flavor: claude
# codex and gemini commented out; run in two passesv0.2 wants a --max-parallel N flag.
Not every task gets solved in one round. multicooker refine <task> runs another round on top of the previous output:
- Each participant sees its previous
./out/in place, RW — edits/replaces/extends it. - Before the run, the previous round is snapshotted into
rounds/<N>/<p>/(immutable history), plus the sealedjudging/_inbox/is copied intorounds/<N>/_inbox/. - Inlined into
PROMPT.txtare:- shared feedback from
cooks/<task>/FEEDBACK.md(a common review for everyone); - personal feedback from
cooks/<task>/FEEDBACK_<flavor>.md(optional, addressed to a specific participant).
- shared feedback from
--participants <list>lets you refine a subset.--feedback <path>swaps the source of shared feedback for an arbitrary file — handy when one piece of feedback applies to several cooks.multicooker diff <task> N Mshows a unified diff between rounds per participant — a sanity-check that refine actually changed something.
Round artifacts: REFINE_<N>.json (start metadata),
REFINE_<N>_RESULT.json (status + duration + rate-limit info per
participant). The full artifact lifecycle is in
docs/lifecycle.md.
After refine, the same judging step is expected:
multicooker judge <task> → multicooker report <task>.
A separate command: re-judge the same snapshot without a
re-cook. Useful when you've edited JUDGE_BRIEF.md (rubric,
weights) or manually patched out/<p>/RESULT.md. It does three
things:
- Re-seals
judging/_inbox/<p>/from the currentwork/<p>/out/(important — a regularjudgeuses the already-sealed inbox and will miss edits toout/). - Cleans previous judges' outboxes in
judging/<judge>/. - Calls the regular
judgeflow (fresh anonymization —_mapping.jsonis always regenerated, the anti-bias guarantee is not weakened).
Parameters: --judges (same as judge).
Each participant run also writes work/<p>/trace.json with
{prompt, model, exit_code, duration_s, started_at, status} — a
cheap structured artifact for debugging and for future replay
scenarios. A full structured-trace version (tool calls) is
deferred — see docs/design-notes.md.
What's left on the TODO list (see docs/todo.md for the current
list):
- Cost ledger — on every run, parse usage from the CLI and
write
cook/cost_ledger.json. - Resume —
multicooker resume <name>re-runs onlyrate_limitedorerrorparticipants, leavingokalone. - Per-participant timeout (currently a global
timeout_s). multicooker diff <task> N M— round comparison.- Replayable traces / registry — structured run trace, versioned task specs (ideas from agentevals / OpenAI Evals).
- Web report —
multicooker serve <name>shows HTML with diffs between submissions, judging logs, and the leaderboard. - Cross-cook leaderboard — global table "claude wins 7 out of 10 tasks, codex 2, gemini 1".
What overnight runs taught us not to do:
- Variadic CLI flags ALWAYS swallow positional args.
claudewith--add-dir <wt>after the prompt leaves the prompt hanging on stdin → 0-byte diff → "0/100 on correctness". Fix: prompt BEFORE variadic flags. - Symlinks inside the sandbox allowlist. Don't work. The CLI
sees a path that resolves outward, silently refuses, no errors —
just an empty outbox. Fix: never symlink into a work-dir we hand
to the CLI with
--add-dir. Only copy. - Codex quota overruns. OpenAI ChatGPT Plus quota ran out every ~5 hours mid-round → one of the three "zeroed out". Fix: accept it (can't be worked around) and in the orchestrator do a per-participant deferred-retry so the others aren't blocked.
- Don't trust the exit code. Many CLIs return 0 even when they hit a limit, because they "successfully reported the limit". Fix: always parse stderr for known-bad patterns.
- Don't write markdown handovers for the CLI expecting it to read them. It will read. But it won't act on it. If you want the participant to change behavior — put it in the prompt, not in a file.
- Mid-run sleep on a Mac. Connection drops to the Anthropic API ← closed lid. caffeinate doesn't always help. Fix — retroactive detection via wall-vs-monotonic skew + one retry.
- Don't trust the leaderboard from the first run. Reproxy-arena overnight #1 showed gemini > codex > claude. After fixing the argv bug and the judge symlinks, the order changed. Only after a smoke test and a second run were the numbers meaningful.
- Artifacts eat disk fast. Reproxy-arena: 4.3 GB over two
overnights. In multicooker the artifact = only
cook/<name>/, no round snapshots; the cap is low, but the habit of cleaning up old cooks is useful.