Use this when you need to launch a fresh all-stages run-e2e, fix issues as they appear, and carry the run through to a truthful final state.
- Pick a fresh
run_id. - Launch a new all-stages run:
python -m cli.aisp bench run-e2e \
--run-id <RUN_ID> \
--run-full-sweep \
--run-fabric \
--cluster-preset common-answer-fast \
--validity-profile portable- Monitor the run continuously with the repo-native status surface:
python -m cli.aisp bench run-e2e-status --run-id <RUN_ID> --watchrun-e2eauto-arms a detached watcher by default. Re-arm it manually if needed:
python -m cli.aisp bench watch-e2e --run-id <RUN_ID>- The dashboard now exposes the same normalized source under
/e2e?run_id=<RUN_ID>instead of reading raw package JSON directly. - For each failure or suspicious weak result:
- root-cause it
- fix it in the most local correct place
- rerun the affected target directly
- resume
run-e2e
- Compare every touched chapter or lab against the matching
book-after/chXX*content. - Keep capability-limited outcomes truthful: use
skipped/partial, not fake green. - Do not disable
nsysorncu. - End with:
- final top-level run status
- per-stage outcomes
- rerun ledger
- exact verification commands
- explanation of anything still
partialor unresolved
- Inspect repo state and current modifications before starting.
- Follow existing repo conventions before introducing new harness, launcher, or profiler patterns.
- Keep fixes local to the benchmark, chapter, lab, or cluster path unless the defect is clearly cross-cutting.
- Dogfood every changed runtime path with a real repo invocation.
- Record exact commands, run ids, and artifact paths while working.
- If the e2e run aborts, fix durability or resume behavior before restarting broad reruns.
- Prefer
run-e2e-statusover manual JSON/log joins. It normalizes live child progress, stale-orchestrator detection, watcher state, recent events, ledgers, and the exact resume command. run-e2edefaults--full-sweep-suite-timeout 0so long full-sweep buckets are not killed by the 4-hour aggregate suite watchdog. Per-benchmark and profiler timeouts still apply.- If a benchmark is unsupported on the current host, emit an explicit
SKIPPED:result instead of degrading silently. - If a speed-goal benchmark lands below
1.05x, ensure status semantics are correct and visible in structured outputs. - For every touched chapter or lab, check the corresponding
book-after/chXX*material and fix any meaningful mismatch. - Preserve provenance packages and historical-failure ledgers; do not hide old failures by deleting artifacts.
- Finish with a truthful summary grouped into:
- green
- partial because of host capability limits
- unresolved failures
book-afteralignment follow-ups
- Launch a brand-new full
run-e2eacross all stages. - Fix anything broken, sub-optimal, or misaligned with the corresponding
book-after/chXX*content. - Re-run targeted failures immediately after each fix.
- Finish with an evidence-backed summary of what is green, what is honestly partial because of host capability limits, and what remains unresolved.
- Review the current repo state and existing local modifications first.
- Do not revert or delete user changes.
- Treat existing modified files as part of the task unless they are clearly unrelated generated artifacts.
- Follow repo conventions before introducing new harness patterns, launcher paths, or workaround flows.
- Use a new run id, for example:
python -m cli.aisp bench run-e2e \
--run-id 20260327_e2e_full_all_fresh \
--run-full-sweep \
--run-fabric \
--cluster-preset common-answer-fast \
--validity-profile portable-
The run package records watcher metadata and status under:
artifacts/e2e_runs/<RUN_ID>/watcher_status.jsonartifacts/e2e_runs/<RUN_ID>/<RUN_ID>_watcher.launch.log
-
Use
run-e2e-statusfor one normalized snapshot instead of manually diffingsummary.json,checkpoint.json,progress.json, and child run artifacts. -
The raw run package now includes
preferred_progress_sourceandactionsfields so humans, MCP clients, and dashboard views can discover the authoritative status surface without reconstructing it manually. -
If the host is virtualized, single-GPU, or lacks IB / Spectrum-X management-plane coverage, keep results truthful:
- do not force
succeededwhen the correct result ispartial - capability-gated multi-GPU work must remain explicit
skipped/partial - fabric should only be fully green when the underlying capability contract is truly satisfied
- do not force
-
Keep benchmark and profiler validity checks strict except where the repo’s
portableprofile explicitly allows compatibility mode. -
Never disable
nsysorncu.
- Monitor the run continuously with
run-e2e-status. - For any failed benchmark, failed profiler, broken resume behavior, missing artifact, bad classification, or suspiciously weak result:
- root-cause it
- fix it in the most local correct place
- re-run the affected target with a realistic repo invocation
- then resume or restart the e2e flow as appropriate
- If a benchmark is not broken but is clearly sub-optimal relative to chapter intent, inspect the matching
book-after/chXX*material and align code, harness expectations, docs/snippets, or runtime semantics as needed. - If a chapter’s code and
book-afterare intentionally different, call that out explicitly in the final summary.
- For each touched chapter or lab, compare against the matching
book-after/chXX*content. - Fix obvious mismatches in:
- benchmark name or intent
- optimization goal
- capability gating
- profiler-path behavior
- code snippet, command, or artifact naming
- performance story stated in the chapter
- If book and code disagree, fix the real source-of-truth mismatch instead of papering over it in the run summary.
- Preserve any existing e2e provenance packages and historical-failure ledgers.
- If the new run uncovers issues, produce the same kind of clear structured ledger or equivalent evidence.
- Do not delete old attempts just to make the current run package look clean.
- Dogfood every changed runtime path with a real repo invocation.
- Record exact commands and outcomes.
- Use
apply_patchfor manual edits. - Prefer local fixes in benchmark/chapter code over weakening the harness.
- Keep structured outputs truthful and auditable.
- If the run aborts mid-flight, fix resume and durability behavior and continue.
- If a benchmark is unsupported on the current host, emit an explicit hard skip rather than a degraded fallback.
- If a speed-goal benchmark lands below
1.05x, ensure status semantics remain correct.
At the end of the run, provide:
- Final top-level run id and terminal state.
- Exact per-stage outcome for
tier1,full_sweep,cluster, andfabric. - Every code or doc fix made.
- Every benchmark, lab, or chapter rerun and its outcome.
- A clear explanation of anything still
partialand why. - Verification commands and artifact paths.
- Any remaining debt, grouped into:
- true failures
- truthful capability-limited partials
book-after/ code alignment follow-ups
Do not stop at analysis. Execute the run, fix issues, and carry it through to a truthful final state.