Skip to content

feat(agent): Add Primus Turbo optimization agent skills#338

Open
ChengYao-amd wants to merge 10 commits into
mainfrom
dev/agent
Open

feat(agent): Add Primus Turbo optimization agent skills#338
ChengYao-amd wants to merge 10 commits into
mainfrom
dev/agent

Conversation

@ChengYao-amd
Copy link
Copy Markdown
Collaborator

No description provided.

xiaobochen-amd and others added 10 commits May 21, 2026 09:44
* add knowdge & rules, add survey step

* fix gemm fp8 blockwise Llama-3.1-405B shape bug

* feat(agent): delete some terminate conditions and add tips distill (#285)
* update skill, modifiy accept standard for FORWARD+BACKWARD, limit sleep most 15min in case of cli stop by accident

* update performance trend format

* kernel-optimize: add quick baseline step to ENVIRONMENT_BASELINE

After representative_shapes are filled, run quick_command once against
them and save the output to rounds/round-1/artifacts/quick_baseline.log.
Later VALIDATE quick rounds can diff their own quick_validation.log
against this reference when metrics look off. Baseline record template
now documents the log path.
* update skill, modifiy accept standard for FORWARD+BACKWARD, limit sleep most 15min in case of cli stop by accident

* update performance trend format

* kernel-optimize: add quick baseline step to ENVIRONMENT_BASELINE

After representative_shapes are filled, run quick_command once against
them and save the output to rounds/round-1/artifacts/quick_baseline.log.
Later VALIDATE quick rounds can diff their own quick_validation.log
against this reference when metrics look off. Baseline record template
now documents the log path.

* fix triton requirements

* update benchmark for consistency

* Add hard rule + skill: forbid benchmark-only caches in kernel optimization

Adds an always-applied hard rule and an operational skill that block
agents from landing wrapper-level caches whose hit rate depends on the
benchmark idiom (same `a` / `grad_out` Python object reused 100 times
inside a timing loop). Such caches inflate benchmark scores but produce
no gain in real LLM training, where activations and grad_out tensors are
fresh tensors each iteration.

New files:
- `agent/rules/no_benchmark_overfitting.mdc`: alwaysApply rule defining
  the forbidden patterns (F1: id(a)-keyed activation cache, F2:
  id(grad_out)-keyed cache, F3: id(scale_of_activation)-keyed cache,
  F4: any non-weight id(...)-keyed cache), allowed patterns (kernel
  fusion, ctx.save_for_backward, weight cache with bounded gain), and
  the required `Real-training transfer check` round summary section.
- `agent/skills/kernel-optimize/avoid-benchmark-overfit/SKILL.md`:
  operational checklist with a 6-step audit (bucket classification,
  id(...) audit, pen-and-paper hit-rate trace, weight-cache gain bound,
  required summary section, tips-file hygiene), worked example, and
  VALIDATE-time checklist.

Updated entry points so agents discover these from the existing flow:
- `agent/skills/kernel-optimize/SKILL.md`: knowledge reference table +
  pre-loop reading list now point at both files.
- `agent/skills/kernel-optimize/workflow/optimize-loop.md`: iteration
  contract section, OPTIMIZE phase, and VALIDATE hard gates now
  reference the rule and the audit, with `id(activation)` /
  `id(grad_out)` / `id(activation_scale)` caches as a hard reject.
- `agent/skills/kernel-optimize/triton/SKILL.md` and `examples.md`:
  start-here reading lists updated.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Extend transfer audit to SURVEY and REPORT phases of optimize loop

Previously avoid-benchmark-overfit was only consumed at OPTIMIZE / VALIDATE.
Direction-search and final-report stages had no equivalent gate, so a
benchmark-only direction could enter the campaign at SURVEY and still be
celebrated at REPORT even after VALIDATE rejected the worst offenders.

This change adds two new gates:

- SURVEY: related-work-template now carries a Real-training Transfer Audit
  table that tags every shortlisted direction with a K1-K4 / W1 / W2 / W3
  bucket, and the Initial Hypothesis Shortlist must be filtered to K1-K4
  plus bounded W1 only. The kernel-optimize SKILL spells this out, and
  avoid-benchmark-overfit gets a Step 0 SURVEY-time direction filter and
  a SURVEY checklist.

- REPORT: optimize-loop's REPORT phase now requires a Real-training
  applicability audit table that re-attributes baseline -> final best
  delta into structural / bounded / benchmark-only components. Final
  report cannot ship if any accepted round still has decision
  REJECT-as-overfit, or if the inflation gap (headline minus real-training
  equivalent) exceeds 1%. avoid-benchmark-overfit gets a Step 7
  REPORT-time re-attribution procedure and a REPORT checklist.

ANALYZE necessarily inherits the same buckets, so candidate directions
must answer a Real-training transfer assessment question before being
promoted to the round's primary hypothesis. W2 / W3 directions are no
longer eligible for promotion under any aggregate-score argument.

Net effect: an agent following this loop can no longer accidentally
spend a campaign chasing a +X% number that disappears in real LLM
training; the same bucket tag follows a direction from survey to final
report.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants