fix: round-6 audit cleanup — 11 fixes (stacked on #31) by sweetcornna · Pull Request #32 · sweetcornna/mathodology

sweetcornna · 2026-05-14T12:00:39Z

Stacked on #31. Round-6 audit fixes from 4 parallel discovery agents (code-quality / cost+perf / UX / sourcemap-unexplored). Lands the P0 items that are small + safe + high-impact.

Critical fixes

AgentEvent contract drift — paper_editor + 6 finetune.* event kinds + kernel.error were emitted at runtime but rejected by the Pydantic contract → every fine-tune session crashed on first emit. Real smoke test caught this. Extended AgentName / EventKind Literals.
PaperEditorAgent.__init__() TypeError — finetune_main passed matlab_session= kwarg the agent doesn't accept. Real smoke test exposed.
Per-run_id mutex between pipeline + finetune consumers (Agent A [Phase 2] M10 Searcher agent + 5-agent pipeline (arXiv retrieval) #2). Two consumers could write the same notebook.ipynb / paper.meta.json / figures/ concurrently. main.py + finetune_main.py now share a dict[UUID, asyncio.Lock].

Performance fixes

Lift _PARAM_PATTERNS to module level (Agent A [Phase 3] v0.2.0 编辑器 UI 重构 + reasoning effort + long context #4): three regexes recompiled on every mine_sensitivity_evidence call. ~3× speedup on the Writer evidence scan.
Reuse _FIG_ID_RE in evidence.py (Agent A [Phase 3] v0.3.0 Award-mode prompts + 20 类图表目录 #5).
Move {{ upstream_reminders }} to BOTTOM of user templates (Agent B cache strategy [Phase 3] v0.2.0 编辑器 UI 重构 + reasoning effort + long context #4): preserves stable prompt-cache prefix across revision rounds. 4 TOMLs touched (searcher / modeler / coder / writer).

Contract / safety

CoderDirective.language: Literal["python","matlab"] (Agent A [Phase 3.5] v0.3.0 论文导出流水线（4 格式 × 3 模板） #6): was str.

Frontend UX

5→6 stage pill grid (Agent C [Phase 2] M9 HMML 知识库 + Modeler 集成 #1): repeat(5, ...) left Critic pill wrapping out of flow; bumped + added 980px breakpoint.
Removed dead Pause button (Agent C [Phase 3] v0.3.0 Award-mode prompts + 20 类图表目录 #5): was disabled with tooltip-only feedback.
loading=\"lazy\" decoding=\"async\" + alt fallback in PaperDraft.vue (Agent C [Phase 3] v0.3.0 Award-mode prompts + 20 类图表目录 #5): Writer emits ![](figures/foo.png) so alt was always empty. Walker now derives a fallback; renderer override adds lazy attrs so long papers don't block first paint.

Test status

pytest: 424 passed (unchanged baseline; all fixes additive)
ruff: clean
pnpm --filter web typecheck: clean

Deferred to round 7

From Agent B (cost):

Per-stage max_revision_rounds (Writer:1, others:2) → ~0.57 RMB/run
Trim coder_cells.source from Writer prompt → ~0.4 RMB/run
Anthropic cache_control: ephemeral wiring → ~0.7-1.0 RMB/run
Concurrent Critic + next-stage prep → ~6.7 min wall savings

From Agent D (sourcemap):

SkillTool on-demand body load (~250 LOC)
Mid-run interrupt control channel (RemoteControlHandle pattern, ~400 LOC) — user-chosen in round 5
Surgical edit tool for PaperEditor (find/replace, ~350 LOC)

From Agent A (code-quality):

_review_and_maybe_revise cost budget hooked to actual gateway cost events (current estimates 7× too low)
Test coverage for run_finetune end-to-end
Lift duplicated _problem_letter_from_problem_text to shared module

Spawned 4 parallel discovery agents (code-quality / cost+perf / UX / sourcemap-unexplored). Each returned a punch list; this PR lands the P0 items that are small + safe + high-impact. Bigger refactors (mid-run interrupt, SkillTool on-demand load, prompt-cache wiring, surgical edit tool) deferred to round 7. ## Critical fixes 1. **AgentEvent contract drift** — `paper_editor` agent literal + 6 `finetune.*` event kinds + `kernel.error` kind were emitted at runtime but rejected by the Pydantic AgentEvent contract; every fine-tune session crashed on the first emit. Real smoke test caught this. Added to packages/py-contracts/src/mm_contracts/agent_io.py AgentName + EventKind Literals. 2. **`PaperEditorAgent.__init__()` TypeError** — finetune_main was passing a `matlab_session=` kwarg the agent doesn't accept. Dropped from finetune_main.py and removed the unused MatlabSession import. 3. **Per-run_id mutex between pipeline + finetune consumers** (Agent A #2). The two consumers were free to write the same notebook + paper.meta.json + figures/ concurrently. main.py + finetune_main.py now share a `dict[UUID, asyncio.Lock]` and grab the lock around the business logic. ## Performance fixes 4. **Lift `_PARAM_PATTERNS` to module level** (Agent A #4): three regexes were being compiled on every `mine_sensitivity_evidence` call. Moved to module-level constants + frozenset blacklist; ~3× speedup on the Writer evidence scan per revision round. 5. **Reuse `_FIG_ID_RE`** in evidence.py (Agent A #5): drop inline `re.finditer(r"\[\[FIG:..." )` recompile. 6. **Move `{{ upstream_reminders }}` to BOTTOM of user templates** (Agent B cache-strategy #4): when it sat at the top, the stable-prefix cache key was broken across every revision. Now ordered: static (system+catalog+exemplars) → dynamic reminders → "Respond with..." instruction. Searcher / Modeler / Coder / Writer. ## Contract / safety fixes 7. **`CoderDirective.language: Literal["python","matlab"]`** (Agent A #6): was `str`, contract drift was invisible to Pydantic. The runtime already normalized; type now matches. ## Frontend UX fixes 8. **5→6 stage pill grid** (Agent C #1): `apps/web/src/styles.css:522` was `repeat(5, ...)` but `StagePills.vue` enumerates 6 agents, leaving Critic wrapping out of flow. Also added a 980px tablet breakpoint for 3 cols. 9. **Removed dead Pause button** (Agent C #5): disabled with tooltip-only feedback. Keep "New run" link; bring back a real Cancel when the backend exposes one. 10. **`<img loading="lazy" decoding="async">` + alt fallback** in PaperDraft.vue (Agent C #5): Writer emits `![](figures/foo.png)` so alt was always empty. Walker now derives a fallback from `title` or the filename stem; renderer override adds the lazy + async attrs so long papers don't block first paint. ## Test status - `uv run --frozen pytest apps/agent-worker -q` → **424 passed**, 1 deselected (unchanged from baseline; all fixes are additive or contract-only). - `uvx ruff check apps/agent-worker/src apps/agent-worker/tests` — clean. - `pnpm --filter web typecheck` — clean. ## Deferred to round 7 (per agent recommendations) From Agent B (cost): per-stage `max_revision_rounds` (Writer:1, others:2), trim coder_cells.source from Writer prompt, Anthropic `cache_control: ephemeral` wiring in `crates/gateway/src/llm/providers/`, concurrent Critic + next-stage prep (~6.7 min wall savings). From Agent D (sourcemap): SkillTool on-demand body load, mid-run interrupt control channel (RemoteControlHandle pattern), surgical edit tool for PaperEditor (find/replace vs wholesale rewrite). From Agent A (code-quality): `_review_and_maybe_revise` cost budget hooked to actual gateway cost events (current estimates are 7× low), test coverage for `run_finetune` end-to-end + Critic revision fall-through, lift duplicated `_problem_letter_from_problem_text` to shared module.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: round-6 audit cleanup — 11 fixes (stacked on #31)#32

fix: round-6 audit cleanup — 11 fixes (stacked on #31)#32
sweetcornna wants to merge 1 commit into
feat/round5-nl-paper-finetunefrom
feat/round6-audit-fixes

sweetcornna commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sweetcornna commented May 14, 2026

Critical fixes

Performance fixes

Contract / safety

Frontend UX

Test status

Deferred to round 7

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant