N-agent eval framework: cross-model > multi-agent, 5 agents counterproductive by Restuta · Pull Request #4 · Restuta/discuss-skill

Restuta · 2026-03-19T13:34:43Z

Summary

Eval framework for objectively measuring discussion quality: expert checklists (13-15 items per topic), trap detection, and blind pairwise LLM judging
15 baseline discussions across 3 topics (fintech payments, monorepo migration, healthcare AI) x 5 configs (1/2/3/5 codex agents + claude+codex cross-model)
Evidence-backed changes: reduce default max_rounds 7→5, recommend cross-model in README, remove 3+ agents from roadmap
Includes PR discussion mode (--pr flag) from earlier commits on this branch

Key Findings

Config	Avg Checklist Coverage	Avg Traps Caught	Avg Duration	Avg Tokens
1-codex (solo)	90%	57%	314s	~2K
2-codex	95%	61%	532s	~6K
3-codex (+synthesizer)	97%	61%	500s	~7K
5-codex (full panel)	92%	78%	1014s	~13K
2-cross (Claude+Codex)	97%	89%	607s	~7K

Cross-model wins: Claude+Codex (97% coverage, 89% traps) consistently outperforms any same-model config
5 agents is counterproductive: worse coverage (92%) at 2x cost — agents go deep on their role and lose breadth
Debate helps vs solo: 90% → 95-97%, especially on cross-domain topics (healthcare)
All discussions converge at round 3: max_rounds=7 was wasteful, reduced to 5
LLM-as-judge has verbosity bias: ranked 5-codex 12-0 despite worse checklist scores — longer outputs get higher scores regardless of quality

What we tried and reverted

Synthesizer consensus template, longer research prompts, new contention table headers — before/after eval showed no measurable improvement (within run-to-run noise). Reverted to keep only evidence-backed changes.

Test plan

Dry run: node tests/n-agent-eval.js --dry-run shows correct 15-run matrix
Smoke test: single topic x config completes successfully
Full matrix: 15 runs completed, scored, and reported
After-changes eval: confirmed template reverts were correct
Existing headless-council.js backward compatible (only max_rounds default changed)

🤖 Generated with Claude Code

Automated evaluation framework for discuss-skill: - tests/eval.js: runner that creates discussion files, runs the orchestrator, and validates output structure/consensus/frontmatter - tests/cases/*.json: test case definitions with custom assertions - Validates: frontmatter state, required sections, consensus format, lens application (research only, not turns) - Reports pass/fail with round count and duration Two test cases: - basic-council: default lens produces valid consensus - custom-lens: simplicity-vs-correctness lens applies correctly Both pass (3 rounds each, ~4 min total). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

/discuss --pr 123 debates the design decisions in a pull request: - Pulls PR context via gh CLI (title, body, diff) - Generates discussion topic from PR, focused on design tradeoffs - Default lens: simplicity-vs-correctness - Posts consensus as PR comment when done - Explicitly scoped to architecture, not code style Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…uality Eval framework: - N-agent orchestrator (headless-council-n.js) supporting 2-5 agents with roles - 3 eval topics with expert checklists (13-15 items each) and trap detection - Comparison runner (n-agent-eval.js) with checklist scoring + blind pairwise judge - Ran 15 discussions: 5 configs x 3 topics (1/2/3/5-codex + claude+codex) Key findings: - Cross-model (Claude+Codex) wins: 97% coverage, 89% traps caught - 3-agent with synthesizer matches cross-model coverage (97%) - 5-agent is counterproductive: 92% coverage at 2x cost - All discussions converge at round 3 regardless of agent count - Research phase is highest leverage (solo agent scores 90%) Changes based on evidence: - Swap consensus to synthesizer template (neutral arbiter framing) - Increase research target from ~200 to ~500 words - Update contention table to "Who Had the Strongest Case & Why" - Reduce default max_rounds from 7 to 5 - Recommend cross-model as default in README - Remove 3+ participant panels from roadmap (data says counterproductive) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Before/after comparison (N=1 per cell) showed the synthesizer consensus template, longer research target, and new contention table header produced no measurable improvement (2-codex: 95%→93%, 2-cross: 97%→95%). Changes are within run-to-run noise but trending negative, so reverting to keep only evidence-backed changes. Retained: max_rounds 7→5 (all 15 runs converged at round 3), roadmap update removing 3+ agents (data shows counterproductive), cross-model recommendation in README (data shows consistent advantage). After-changes eval results added for comparison. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Links the article that inspired the N-agent eval experiment, with our own findings noted (3-agent matches cross-model, 5-agent counterproductive). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Restuta and others added 4 commits March 19, 2026 20:32

Restuta changed the title ~~Eval framework + PR discussion mode~~ N-agent eval framework: cross-model > multi-agent, 5 agents counterproductive Apr 1, 2026

Add MindStudio multi-agent debate article to research docs

62cd071

Links the article that inspired the N-agent eval experiment, with our own findings noted (3-agent matches cross-model, 5-agent counterproductive). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

N-agent eval framework: cross-model > multi-agent, 5 agents counterproductive#4

N-agent eval framework: cross-model > multi-agent, 5 agents counterproductive#4
Restuta wants to merge 5 commits intomainfrom
evals-and-pr-discuss

Restuta commented Mar 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Restuta commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Findings

What we tried and reverted

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Restuta commented Mar 19, 2026 •

edited

Loading