Skip to content

faeton/multicooker

Repository files navigation

multicooker

Run several LLM agents (claude, codex, gemini) on the same task in parallel — each in its own docker container with its own subscription auth — then have other LLM agents read the outputs blind (under A / B / C labels), score them against your rubric, and write reviews.

You get a leaderboard.md plus a corpus of N divergent solutions to one brief. No API bills: it goes through your Claude Pro / ChatGPT Plus / Gemini Advanced subscriptions.

«multicooker»: one task, several dishes cook in parallel in their own pots; you compare what came out of each.

🇷🇺 Russian version: README.ru.md.

Why

When a task is underspecified — design, copy, refactoring with architectural choice, code review — there is no single "correct" answer. Any model will fill in the gaps from the brief itself, and what it fills in is the interesting part. A single run through a single model doesn't show this; you only see one interpretation and assume it's "the answer".

multicooker gives you a corpus of divergent interpretations of the same brief in one shot. Useful when:

  • You're picking between models for a recurring task (refactoring, design, doc writing, code review) and tired of deciding by vibes.
  • You want to see where a brief is underspecified — disagreement between models highlights exactly those spots.
  • You're doing design or copy work and want three takes from three different "heads" instead of one.
  • You're studying how much models agree with each other on open tasks (often: not much).

How it works (one cook end-to-end)

                     ┌─────────────────────────────┐
                     │      cooks/260516-task/     │
                     │  BRIEF.md  JUDGE_BRIEF.md   │
                     │  brief.yaml      raw/       │
                     └──────────────┬──────────────┘
                                    │ multicooker cook
        ┌───────────────────────────┼───────────────────────────┐
        ▼                           ▼                           ▼
 ┌─────────────┐             ┌─────────────┐             ┌─────────────┐
 │  claude     │             │  codex      │             │  gemini     │
 │  container  │   (parallel)│  container  │  (parallel) │  container  │
 │  net-A      │             │  net-B      │             │  net-C      │
 │  /work/...  │             │  /work/...  │             │  /work/...  │
 └──────┬──────┘             └──────┬──────┘             └──────┬──────┘
        │ out/                      │ out/                      │ out/
        └───────────────────────────┼───────────────────────────┘
                                    ▼
                       ┌─────────────────────────┐
                       │   anonymize → A/B/C     │
                       │   mapping stays on host │
                       └──────────────┬──────────┘
                                      │ multicooker judge
                ┌─────────────────────┼─────────────────────┐
                ▼                                           ▼
       ┌─────────────────┐                         ┌─────────────────┐
       │  judge-1        │                         │  judge-2        │
       │  (claude/codex/ │  scores everyone except │  (different     │
       │   gemini)       │  its own flavor         │   flavor)       │
       └────────┬────────┘                         └────────┬────────┘
                │ scores.json + review.md                   │
                └─────────────────────┬─────────────────────┘
                                      ▼ multicooker report
                            ┌──────────────────┐
                            │ leaderboard.md   │
                            └──────────────────┘

The key properties:

  • Isolation. Each participant runs in its own container on its own bridge network — can't see the other participants, the judge brief, or the A↔flavor mapping.
  • Parallelism. All participants start at the same time. One being rate-limited doesn't block the others.
  • Anonymization. Judges only see A / B / C with no model names. The mapping lives only on the host.
  • Anti-self-judge. A judge never scores submissions from its own flavor — claude doesn't judge claude's output.
  • No API keys. Subscription credentials (Claude Pro / ChatGPT Plus / Gemini Advanced) are passed into containers via bind-mount or named volume, read-only. See docs/auth.md.

Install

git clone https://github.com/faeton/multicooker
cd multicooker
pip install -e .

Requirements:

  • macOS or Linux host with a running docker daemon. On macOS, OrbStack is the recommended runtime — noticeably faster startup, lower idle CPU, and friendlier resource handling than Docker Desktop. Docker Desktop and colima also work.
  • Python 3.10+.
  • At least one of these CLIs installed and logged in: claude (claude /login), codex (codex to log in), gemini (gemini to log in). Only the flavors you actually want to run.

Want to try the pipeline without subscription creds? There's a dummy flavor — see examples/hello-task.

Quick start: let an agent scaffold the cook (10 seconds)

The fastest way to use multicooker is to fire up an LLM agent inside the repo and let it scaffold and run the cook for you. The repo ships with a CLAUDE.md (and an AGENTS.md symlink for codex / gemini) that already explains the project, the shape of a cook, and the rule that the rubric stays in sync between brief.yaml and JUDGE_BRIEF.md. Any agent reading it can do the boring part for you.

git clone https://github.com/faeton/multicooker && cd multicooker
pip install -e .

claude        # or: codex, or: gemini — they all read AGENTS.md

Then describe what you want in plain language:

"Set up a cook called landing-redesign. Compare claude / codex / gemini on a single-file HTML hero for [product]. Judge on visual-hierarchy, typography, color-discipline, content-fit, polish. References are at ~/work/brand/notes.md and ~/work/brand/voice.md. Then run cook + judge + report."

The agent reads CLAUDE.md and examples/design-landing/ as templates, drafts your BRIEF.md / JUDGE_BRIEF.md / brief.yaml, copies the refs into raw/, kicks off multicooker cook, waits for it to finish, then runs judge and report. You read the leaderboard.

Iterating is the same conversation:

"Feedback for everyone: too much whitespace, push for denser layout. Specifically for claude: keep the color palette but tighten the type scale. Refine."

Or — start a new cook reusing the same reference material (different task, same brand assets):

"Same refs as the previous cook. New brief: a 3-frame onboarding sequence instead of a single landing. Judge the same dimensions plus story-clarity. Run it."

This is the canonical workflow. The manual flow below is useful for understanding the moving parts, but it's not how you'd typically use the tool day-to-day.

Manual flow (5 minutes, full control)

# 1. Preflight — docker, compose, creds for each flavor
multicooker doctor

# 2. Scaffold (name is auto-prefixed with today's date → 260509-my-task)
multicooker new my-task

# 3. Describe the task
cd cooks/260509-my-task
$EDITOR BRIEF.md          # what participants must do
$EDITOR JUDGE_BRIEF.md    # how judges will score
$EDITOR brief.yaml        # participants, judges, timeout, rubric
cp ~/some-reference.* raw/   # reference materials (mounted RO)

# 4. Cook — all participants in parallel, each in its own container
multicooker cook 260509-my-task

# 5. Judge — blind: judges only see A/B/C labels
multicooker judge 260509-my-task

# 6. Summary → leaderboard.md
multicooker report 260509-my-task
cat cooks/260509-my-task/leaderboard.md

Examples

Two ready-to-run examples in the repo — copy and go:

  • examples/hello-task — sanitized smoke test on the dummy flavor, no LLM creds required. ~10 seconds from start to leaderboard. Run it once to see the shape of a cook on the simplest possible task.

  • examples/design-landing — a real design task: each model designs its own landing page for multicooker. Three HTML files you then compare side-by-side in a browser. More on this below.

Use case: design and creative tasks

The most illustrative use case is tasks where there's no right answer but there are quality criteria. Design, copy, naming, architectural essays. Here models diverge not because one is buggy but because they hold different "aesthetic beliefs", and comparison becomes substantive.

examples/design-landing is a working template for this kind of cook. Brief: "design a landing page for multicooker, single-file HTML, no build step". When you open the three index.html files side by side, you typically see:

  • Palette. One model commits to strict monochrome; another scatters six accent colors and doesn't quite know what to do with them; another defaults to dark mode.
  • Typography. Someone reaches for the system stack; someone pulls Inter from Google Fonts; someone leaves the default serif — and the hero blocks read completely differently as a result.
  • Density. One packs features into a three-column grid with small text; another goes for one big half-screen block.
  • Content fit. Someone quotes raw/product.md verbatim; someone reimagines the product according to their own theories of what a "proper landing" should be (the content-fit dimension in the rubric exists to catch this).
  • Polish. Hover states, spacing rhythm, code-block styling, footer treatment — small decisions that separate "draft" from "shipped".

The rubric in examples/design-landing/JUDGE_BRIEF.md scores on visual-hierarchy / typography / color-discipline / content-fit / polish. Two judges of different flavors score blindly — and they often disagree with each other. That's a useful signal: on design tasks, judge disagreement means there's no "winner on points", just three different directions, and you pick with your eyes.

# Run the design example (requires claude/codex/gemini logins)
multicooker new landing --participants claude,codex,gemini
TASK=$(basename "$(ls -d cooks/*-landing | tail -1)")
cp examples/design-landing/{BRIEF.md,JUDGE_BRIEF.md,brief.yaml} cooks/$TASK/
cp examples/design-landing/raw/* cooks/$TASK/raw/

multicooker cook   $TASK
multicooker judge  $TASK
multicooker report $TASK

# Open all three variants side by side, plus the leaderboard
open cooks/$TASK/out/*/index.html
cat  cooks/$TASK/leaderboard.md

This template adapts to any design task — SVG logo, README header, email template, dashboard mockup. You only need to rewrite BRIEF.md for your output and tweak the rubric dimensions (brand-fit, accessibility, density, motion-restraint — anything, as long as the names match between brief.yaml and JUDGE_BRIEF.md). See examples/design-landing/README.md for the full adaptation guide.

Iterating on a result

$EDITOR cooks/260509-my-task/FEEDBACK.md          # general feedback
$EDITOR cooks/260509-my-task/FEEDBACK_claude.md   # per-participant (optional)

multicooker refine 260509-my-task    # round N+1 on top of previous out/
multicooker judge  260509-my-task
multicooker report 260509-my-task

Previous rounds are preserved in rounds/<N>/ — nothing is lost. multicooker diff <task> shows what moved at file level between two rounds — useful for spotting which model actually took the feedback to heart vs which one just rephrased the previous answer.

Multiple participants of the same flavor / different models

multicooker new comparison \
  --participants claude-a=claude,claude-b=claude,codex,gemini

Per-participant model selection lives in brief.yaml:

participants:
  - { name: claude-sonnet, flavor: claude, model: claude-sonnet-4-6 }
  - { name: claude-opus,   flavor: claude, model: claude-opus-4-7 }
  - { name: codex }

Useful for, e.g., pitting sonnet against opus on the same task — two horses of the same flavor under different names, with different models.

Isolation and security (short version)

  • One docker compose project per cook (mc-<task>).
  • Each participant is in its own container on its own bridge network (net-participant-<name>); they don't see each other via DNS/IP.
  • Subscription creds are snapshotted into cooks/<task>/.auth/<flavor>/ (mode 0600, .gitignore'd) and bind-mounted RO only into the corresponding container.
  • After the cook, sealed out/ is anonymized into A/B/C/… before judging. The A↔flavor mapping lives on the host only, never goes into judge containers.
  • Egress to the internet is open. Sandbox = container, not network. Threat model: docs/security.md.

The long version: HOWTO.md. Internals: docs/orchestration.md, docs/auth.md, docs/lifecycle.md.

Commands

Command What it does
multicooker new <task> [--participants ...] Create a cook from templates.
multicooker doctor [<task>] Preflight: docker, compose, creds, Dockerfiles, base images.
multicooker build-base [<flavor>...] Build the shared base image (auto-built before the first cook).
multicooker cook <task> Launch all participants in parallel.
multicooker refine <task> Round N+1 with feedback on top of previous out.
multicooker judge <task> Anonymized scoring by all judges.
multicooker rejudge <task> Re-run judging (e.g. after editing JUDGE_BRIEF.md).
multicooker report <task> Roll-up into leaderboard.md.
multicooker diff <task> File-level diff between two refine rounds.
multicooker add-participant <task> NAME[=FLAVOR] Add another participant to an existing cook.
multicooker clean [<task>] [--all] compose down -v --rmi local + remove .auth/.

Status

v0.2. Tested on macOS with OrbStack and Docker Desktop. Linux should work; claude creds on darwin come from Keychain, on Linux from ~/.claude/.credentials.json.

Bugs → GitHub issues. Security: SECURITY.md.

License

MIT.

About

Run several LLM agents on the same task in parallel docker sandboxes, then have other LLMs judge them. Uses your Claude Pro / ChatGPT Plus / Gemini Advanced subscriptions — no API keys.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors