Skip to content

Conversation

@abrichr
Copy link
Member

@abrichr abrichr commented Jan 17, 2026

Summary

Complete rewrite of /docs/publication-roadmap.md from the perspective of a skeptical reviewer at a top venue (NeurIPS, ICML, CHI). The goal is a paper that could be accepted - not just submitted.

Key Changes

Honest Evidence Assessment

  • Acknowledges that all 45 macOS tasks share the SAME first action (click Apple menu)
  • Notes WAA baseline was interrupted (1/8 tasks, agent bugs)
  • Frames results honestly: "trajectory-conditioned disambiguation of UI affordances"

Contribution Clarity

  • Demo-conditioning is a prompting strategy, not a new architecture
  • Explicitly positions as an empirical study, not algorithmic contribution
  • Lists what this is NOT: new model, training method, benchmark, or theoretical contribution

Statistical Rigor

  • Requires McNemar's test for paired comparisons
  • Bootstrap confidence intervals for all metrics
  • Effect size (Cohen's h) alongside p-values
  • Minimum sample size calculations (n >= 39 for 20pp effect at 80% power)

Experiment Design

  • Minimum viable: 20 tasks x 2 models x 3 trials = 240 runs
  • Full paper: ~1500 runs across WAA, WebArena, ablations
  • Essential ablations: demo format, relevance, k values
  • Required baselines: zero-shot, CoT, text-only few-shot, SOTA, random

Weakness Analysis

  • Anticipates reviewer criticisms with severity ratings
  • Identifies what we CANNOT fix (novelty, benchmark saturation)
  • Identifies what we CAN fix (benchmark coverage, multi-model, statistics)

Venue Fit

  • Realistic assessment: NeurIPS main <20%, Workshop 60-70%
  • Recommends: Workshop first (8 weeks), then CHI/UIST (6 months)
  • Only pursue main track IF WAA shows >30pp improvement

Risk Mitigation

  • Pivot strategies if results disappoint
  • Negative results paper option
  • Reviewer response templates

Document Structure

  1. Current State of Evidence
  2. Honest Contribution Assessment
  3. Weakness Analysis
  4. Required Experiments for Defensible Claims
  5. Statistical Rigor Requirements
  6. Related Work Gap Analysis (18 essential citations)
  7. Venue Fit Analysis
  8. Realistic Timeline
  9. Risk Mitigation
  10. Action Items

Appendices

  • A: Honest framing (abstract template, title options)
  • B: Cost estimates (~$200-400 API, ~$25 compute)
  • C: Reviewer response templates

Test plan

  • Markdown renders correctly
  • All internal links work
  • Tables format properly
  • Review by team for accuracy of current state

Generated with Claude Code

Complete rewrite of publication-roadmap.md from the perspective of a
skeptical top-venue reviewer. Key changes:

- Honest assessment of current evidence (n=45 macOS, n=8 WAA incomplete)
- Acknowledgment that all 45 tasks share same first action (click Apple menu)
- Clear novelty analysis: demo-conditioning is prompting strategy, not architecture
- Anticipated reviewer criticisms with severity ratings
- Required experiments: 240 runs minimum for workshop, ~1500 for full paper
- Statistical rigor: McNemar's test, bootstrap CIs, effect sizes
- 18 essential citations across GUI agents, PbD, VLMs, and RAG
- Realistic venue assessment: workshop 60-70%, NeurIPS main <20%
- 8-week timeline for workshop paper, 6 months for CHI full paper
- Risk mitigation including pivot strategies if results disappoint
- Cost estimates: $200-400 API, $25 compute
- Reviewer response templates for common criticisms

The goal is a paper that could be accepted at a top venue - not just submitted.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@abrichr abrichr merged commit 37170ee into main Jan 17, 2026
6 checks passed
@abrichr abrichr deleted the feature/rigorous-publication-roadmap branch January 17, 2026 05:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants