Skip to content

fix: round-6 audit cleanup — 11 fixes (stacked on #31)#32

Open
sweetcornna wants to merge 1 commit into
feat/round5-nl-paper-finetunefrom
feat/round6-audit-fixes
Open

fix: round-6 audit cleanup — 11 fixes (stacked on #31)#32
sweetcornna wants to merge 1 commit into
feat/round5-nl-paper-finetunefrom
feat/round6-audit-fixes

Conversation

@sweetcornna
Copy link
Copy Markdown
Owner

Stacked on #31. Round-6 audit fixes from 4 parallel discovery agents (code-quality / cost+perf / UX / sourcemap-unexplored). Lands the P0 items that are small + safe + high-impact.

Critical fixes

  1. AgentEvent contract driftpaper_editor + 6 finetune.* event kinds + kernel.error were emitted at runtime but rejected by the Pydantic contract → every fine-tune session crashed on first emit. Real smoke test caught this. Extended AgentName / EventKind Literals.

  2. PaperEditorAgent.__init__() TypeErrorfinetune_main passed matlab_session= kwarg the agent doesn't accept. Real smoke test exposed.

  3. Per-run_id mutex between pipeline + finetune consumers (Agent A [Phase 2] M10 Searcher agent + 5-agent pipeline (arXiv retrieval) #2). Two consumers could write the same notebook.ipynb / paper.meta.json / figures/ concurrently. main.py + finetune_main.py now share a dict[UUID, asyncio.Lock].

Performance fixes

  1. Lift _PARAM_PATTERNS to module level (Agent A [Phase 3] v0.2.0 编辑器 UI 重构 + reasoning effort + long context #4): three regexes recompiled on every mine_sensitivity_evidence call. ~3× speedup on the Writer evidence scan.

  2. Reuse _FIG_ID_RE in evidence.py (Agent A [Phase 3] v0.3.0 Award-mode prompts + 20 类图表目录 #5).

  3. Move {{ upstream_reminders }} to BOTTOM of user templates (Agent B cache strategy [Phase 3] v0.2.0 编辑器 UI 重构 + reasoning effort + long context #4): preserves stable prompt-cache prefix across revision rounds. 4 TOMLs touched (searcher / modeler / coder / writer).

Contract / safety

  1. CoderDirective.language: Literal["python","matlab"] (Agent A [Phase 3.5] v0.3.0 论文导出流水线(4 格式 × 3 模板) #6): was str.

Frontend UX

  1. 5→6 stage pill grid (Agent C [Phase 2] M9 HMML 知识库 + Modeler 集成 #1): repeat(5, ...) left Critic pill wrapping out of flow; bumped + added 980px breakpoint.

  2. Removed dead Pause button (Agent C [Phase 3] v0.3.0 Award-mode prompts + 20 类图表目录 #5): was disabled with tooltip-only feedback.

  3. loading=\"lazy\" decoding=\"async\" + alt fallback in PaperDraft.vue (Agent C [Phase 3] v0.3.0 Award-mode prompts + 20 类图表目录 #5): Writer emits ![](figures/foo.png) so alt was always empty. Walker now derives a fallback; renderer override adds lazy attrs so long papers don't block first paint.

Test status

  • pytest: 424 passed (unchanged baseline; all fixes additive)
  • ruff: clean
  • pnpm --filter web typecheck: clean

Deferred to round 7

From Agent B (cost):

  • Per-stage max_revision_rounds (Writer:1, others:2) → ~0.57 RMB/run
  • Trim coder_cells.source from Writer prompt → ~0.4 RMB/run
  • Anthropic cache_control: ephemeral wiring → ~0.7-1.0 RMB/run
  • Concurrent Critic + next-stage prep → ~6.7 min wall savings

From Agent D (sourcemap):

  • SkillTool on-demand body load (~250 LOC)
  • Mid-run interrupt control channel (RemoteControlHandle pattern, ~400 LOC) — user-chosen in round 5
  • Surgical edit tool for PaperEditor (find/replace, ~350 LOC)

From Agent A (code-quality):

  • _review_and_maybe_revise cost budget hooked to actual gateway cost events (current estimates 7× too low)
  • Test coverage for run_finetune end-to-end
  • Lift duplicated _problem_letter_from_problem_text to shared module

Spawned 4 parallel discovery agents (code-quality / cost+perf / UX /
sourcemap-unexplored). Each returned a punch list; this PR lands the
P0 items that are small + safe + high-impact. Bigger refactors (mid-run
interrupt, SkillTool on-demand load, prompt-cache wiring, surgical edit
tool) deferred to round 7.

## Critical fixes

1. **AgentEvent contract drift** — `paper_editor` agent literal + 6
   `finetune.*` event kinds + `kernel.error` kind were emitted at
   runtime but rejected by the Pydantic AgentEvent contract; every
   fine-tune session crashed on the first emit. Real smoke test caught
   this. Added to packages/py-contracts/src/mm_contracts/agent_io.py
   AgentName + EventKind Literals.

2. **`PaperEditorAgent.__init__()` TypeError** — finetune_main was
   passing a `matlab_session=` kwarg the agent doesn't accept. Dropped
   from finetune_main.py and removed the unused MatlabSession import.

3. **Per-run_id mutex between pipeline + finetune consumers** (Agent A
   #2). The two consumers were free to write the same notebook +
   paper.meta.json + figures/ concurrently. main.py + finetune_main.py
   now share a `dict[UUID, asyncio.Lock]` and grab the lock around the
   business logic.

## Performance fixes

4. **Lift `_PARAM_PATTERNS` to module level** (Agent A #4): three
   regexes were being compiled on every `mine_sensitivity_evidence`
   call. Moved to module-level constants + frozenset blacklist; ~3×
   speedup on the Writer evidence scan per revision round.

5. **Reuse `_FIG_ID_RE`** in evidence.py (Agent A #5): drop inline
   `re.finditer(r"\[\[FIG:..." )` recompile.

6. **Move `{{ upstream_reminders }}` to BOTTOM of user templates**
   (Agent B cache-strategy #4): when it sat at the top, the
   stable-prefix cache key was broken across every revision. Now
   ordered: static (system+catalog+exemplars) → dynamic reminders →
   "Respond with..." instruction. Searcher / Modeler / Coder / Writer.

## Contract / safety fixes

7. **`CoderDirective.language: Literal["python","matlab"]`** (Agent A
   #6): was `str`, contract drift was invisible to Pydantic. The
   runtime already normalized; type now matches.

## Frontend UX fixes

8. **5→6 stage pill grid** (Agent C #1): `apps/web/src/styles.css:522`
   was `repeat(5, ...)` but `StagePills.vue` enumerates 6 agents,
   leaving Critic wrapping out of flow. Also added a 980px tablet
   breakpoint for 3 cols.

9. **Removed dead Pause button** (Agent C #5): disabled with
   tooltip-only feedback. Keep "New run" link; bring back a real
   Cancel when the backend exposes one.

10. **`<img loading="lazy" decoding="async">` + alt fallback** in
    PaperDraft.vue (Agent C #5): Writer emits `![](figures/foo.png)`
    so alt was always empty. Walker now derives a fallback from
    `title` or the filename stem; renderer override adds the lazy +
    async attrs so long papers don't block first paint.

## Test status

- `uv run --frozen pytest apps/agent-worker -q` → **424 passed**, 1
  deselected (unchanged from baseline; all fixes are additive or
  contract-only).
- `uvx ruff check apps/agent-worker/src apps/agent-worker/tests` —
  clean.
- `pnpm --filter web typecheck` — clean.

## Deferred to round 7 (per agent recommendations)

From Agent B (cost): per-stage `max_revision_rounds` (Writer:1, others:2),
trim coder_cells.source from Writer prompt, Anthropic
`cache_control: ephemeral` wiring in `crates/gateway/src/llm/providers/`,
concurrent Critic + next-stage prep (~6.7 min wall savings).

From Agent D (sourcemap): SkillTool on-demand body load, mid-run
interrupt control channel (RemoteControlHandle pattern), surgical edit
tool for PaperEditor (find/replace vs wholesale rewrite).

From Agent A (code-quality): `_review_and_maybe_revise` cost budget
hooked to actual gateway cost events (current estimates are 7× low),
test coverage for `run_finetune` end-to-end + Critic revision
fall-through, lift duplicated `_problem_letter_from_problem_text` to
shared module.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant