N-agent eval framework: cross-model > multi-agent, 5 agents counterproductive#4
Open
N-agent eval framework: cross-model > multi-agent, 5 agents counterproductive#4
Conversation
Automated evaluation framework for discuss-skill: - tests/eval.js: runner that creates discussion files, runs the orchestrator, and validates output structure/consensus/frontmatter - tests/cases/*.json: test case definitions with custom assertions - Validates: frontmatter state, required sections, consensus format, lens application (research only, not turns) - Reports pass/fail with round count and duration Two test cases: - basic-council: default lens produces valid consensus - custom-lens: simplicity-vs-correctness lens applies correctly Both pass (3 rounds each, ~4 min total). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
/discuss --pr 123 debates the design decisions in a pull request: - Pulls PR context via gh CLI (title, body, diff) - Generates discussion topic from PR, focused on design tradeoffs - Default lens: simplicity-vs-correctness - Posts consensus as PR comment when done - Explicitly scoped to architecture, not code style Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…uality Eval framework: - N-agent orchestrator (headless-council-n.js) supporting 2-5 agents with roles - 3 eval topics with expert checklists (13-15 items each) and trap detection - Comparison runner (n-agent-eval.js) with checklist scoring + blind pairwise judge - Ran 15 discussions: 5 configs x 3 topics (1/2/3/5-codex + claude+codex) Key findings: - Cross-model (Claude+Codex) wins: 97% coverage, 89% traps caught - 3-agent with synthesizer matches cross-model coverage (97%) - 5-agent is counterproductive: 92% coverage at 2x cost - All discussions converge at round 3 regardless of agent count - Research phase is highest leverage (solo agent scores 90%) Changes based on evidence: - Swap consensus to synthesizer template (neutral arbiter framing) - Increase research target from ~200 to ~500 words - Update contention table to "Who Had the Strongest Case & Why" - Reduce default max_rounds from 7 to 5 - Recommend cross-model as default in README - Remove 3+ participant panels from roadmap (data says counterproductive) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Before/after comparison (N=1 per cell) showed the synthesizer consensus template, longer research target, and new contention table header produced no measurable improvement (2-codex: 95%→93%, 2-cross: 97%→95%). Changes are within run-to-run noise but trending negative, so reverting to keep only evidence-backed changes. Retained: max_rounds 7→5 (all 15 runs converged at round 3), roadmap update removing 3+ agents (data shows counterproductive), cross-model recommendation in README (data shows consistent advantage). After-changes eval results added for comparison. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Links the article that inspired the N-agent eval experiment, with our own findings noted (3-agent matches cross-model, 5-agent counterproductive). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--prflag) from earlier commits on this branchKey Findings
What we tried and reverted
Synthesizer consensus template, longer research prompts, new contention table headers — before/after eval showed no measurable improvement (within run-to-run noise). Reverted to keep only evidence-backed changes.
Test plan
node tests/n-agent-eval.js --dry-runshows correct 15-run matrixheadless-council.jsbackward compatible (only max_rounds default changed)🤖 Generated with Claude Code