Skip to content

N-agent eval framework: cross-model > multi-agent, 5 agents counterproductive#4

Open
Restuta wants to merge 5 commits intomainfrom
evals-and-pr-discuss
Open

N-agent eval framework: cross-model > multi-agent, 5 agents counterproductive#4
Restuta wants to merge 5 commits intomainfrom
evals-and-pr-discuss

Conversation

@Restuta
Copy link
Copy Markdown
Owner

@Restuta Restuta commented Mar 19, 2026

Summary

  • Eval framework for objectively measuring discussion quality: expert checklists (13-15 items per topic), trap detection, and blind pairwise LLM judging
  • 15 baseline discussions across 3 topics (fintech payments, monorepo migration, healthcare AI) x 5 configs (1/2/3/5 codex agents + claude+codex cross-model)
  • Evidence-backed changes: reduce default max_rounds 7→5, recommend cross-model in README, remove 3+ agents from roadmap
  • Includes PR discussion mode (--pr flag) from earlier commits on this branch

Key Findings

Config Avg Checklist Coverage Avg Traps Caught Avg Duration Avg Tokens
1-codex (solo) 90% 57% 314s ~2K
2-codex 95% 61% 532s ~6K
3-codex (+synthesizer) 97% 61% 500s ~7K
5-codex (full panel) 92% 78% 1014s ~13K
2-cross (Claude+Codex) 97% 89% 607s ~7K
  1. Cross-model wins: Claude+Codex (97% coverage, 89% traps) consistently outperforms any same-model config
  2. 5 agents is counterproductive: worse coverage (92%) at 2x cost — agents go deep on their role and lose breadth
  3. Debate helps vs solo: 90% → 95-97%, especially on cross-domain topics (healthcare)
  4. All discussions converge at round 3: max_rounds=7 was wasteful, reduced to 5
  5. LLM-as-judge has verbosity bias: ranked 5-codex 12-0 despite worse checklist scores — longer outputs get higher scores regardless of quality

What we tried and reverted

Synthesizer consensus template, longer research prompts, new contention table headers — before/after eval showed no measurable improvement (within run-to-run noise). Reverted to keep only evidence-backed changes.

Test plan

  • Dry run: node tests/n-agent-eval.js --dry-run shows correct 15-run matrix
  • Smoke test: single topic x config completes successfully
  • Full matrix: 15 runs completed, scored, and reported
  • After-changes eval: confirmed template reverts were correct
  • Existing headless-council.js backward compatible (only max_rounds default changed)

🤖 Generated with Claude Code

Restuta and others added 4 commits March 19, 2026 20:32
Automated evaluation framework for discuss-skill:
- tests/eval.js: runner that creates discussion files, runs the
  orchestrator, and validates output structure/consensus/frontmatter
- tests/cases/*.json: test case definitions with custom assertions
- Validates: frontmatter state, required sections, consensus format,
  lens application (research only, not turns)
- Reports pass/fail with round count and duration

Two test cases:
- basic-council: default lens produces valid consensus
- custom-lens: simplicity-vs-correctness lens applies correctly

Both pass (3 rounds each, ~4 min total).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
/discuss --pr 123 debates the design decisions in a pull request:
- Pulls PR context via gh CLI (title, body, diff)
- Generates discussion topic from PR, focused on design tradeoffs
- Default lens: simplicity-vs-correctness
- Posts consensus as PR comment when done
- Explicitly scoped to architecture, not code style

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…uality

Eval framework:
- N-agent orchestrator (headless-council-n.js) supporting 2-5 agents with roles
- 3 eval topics with expert checklists (13-15 items each) and trap detection
- Comparison runner (n-agent-eval.js) with checklist scoring + blind pairwise judge
- Ran 15 discussions: 5 configs x 3 topics (1/2/3/5-codex + claude+codex)

Key findings:
- Cross-model (Claude+Codex) wins: 97% coverage, 89% traps caught
- 3-agent with synthesizer matches cross-model coverage (97%)
- 5-agent is counterproductive: 92% coverage at 2x cost
- All discussions converge at round 3 regardless of agent count
- Research phase is highest leverage (solo agent scores 90%)

Changes based on evidence:
- Swap consensus to synthesizer template (neutral arbiter framing)
- Increase research target from ~200 to ~500 words
- Update contention table to "Who Had the Strongest Case & Why"
- Reduce default max_rounds from 7 to 5
- Recommend cross-model as default in README
- Remove 3+ participant panels from roadmap (data says counterproductive)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Before/after comparison (N=1 per cell) showed the synthesizer consensus
template, longer research target, and new contention table header produced
no measurable improvement (2-codex: 95%→93%, 2-cross: 97%→95%). Changes
are within run-to-run noise but trending negative, so reverting to keep
only evidence-backed changes.

Retained: max_rounds 7→5 (all 15 runs converged at round 3), roadmap
update removing 3+ agents (data shows counterproductive), cross-model
recommendation in README (data shows consistent advantage).

After-changes eval results added for comparison.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Restuta Restuta changed the title Eval framework + PR discussion mode N-agent eval framework: cross-model > multi-agent, 5 agents counterproductive Apr 1, 2026
Links the article that inspired the N-agent eval experiment, with our
own findings noted (3-agent matches cross-model, 5-agent counterproductive).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant