Skip to content

review: calibrate severity through reachability, consequence, and evidence#276

Merged
rgushchin merged 1 commit into
sashiko-dev:mainfrom
ftabba:tabba/severity_calibration
Jun 22, 2026
Merged

review: calibrate severity through reachability, consequence, and evidence#276
rgushchin merged 1 commit into
sashiko-dev:mainfrom
ftabba:tabba/severity_calibration

Conversation

@ftabba

@ftabba ftabba commented Jun 14, 2026

Copy link
Copy Markdown

Reworks the severity-calibration change in this PR per the review feedback. The earlier reachability/consequence/evidence form is dropped.

The Stage 10 verification prompt assigns severity in one step, which over-states it (Critical for a bad-sounding consequence even when the code is barely reachable or the trigger is only assumed). This adds a short calibration guide to severity.md and reduces the Stage 10 item to a pointer at it. The reviewer reasons through consequence, triggering path, and reachability, and lowers the level only on the grounds the guide names:

  • cosmetic, or code shown to be unreachable: at most Low.
  • no concrete triggering path: at most Medium, marked speculative but still reported, not dropped just for the missing path.
  • Critical or High needs a damaging consequence and a concrete path on code not shown to be unreachable.

Reachability is given as examples, not a fixed set to choose from, with an explicit "can't determine, so don't lower the level" escape (the forced-choice point you raised). Pure prompt change: the four labels, schema, and report format are unchanged.

Measurement (99-patch subset of the benchmark, same model). Two baseline runs with no calibration gave 48 and 54 exact detections; the updated rubric gives 52, so recall holds, and it emits the fewest findings of the set (359 vs about 368 at baseline) at equal recall. The full 999 is very token-heavy and out of reach here, so this is a subset.

@rgushchin

Copy link
Copy Markdown
Member

So tbh I've uneasy feelings about this change. I like the direction, but I'm also afraid that we might start missing critical bugs because sashiko is not smart enough to fully understand the reachability path. Let me run my full benchmark and see how it changes things.

@ftabba

ftabba commented Jun 15, 2026

Copy link
Copy Markdown
Author

Sure. If you'd like me to test or run anything as well I'm happy to help. I've been interacting a lot with Sashiko lately, both on-list and off-list.

@rgushchin

Copy link
Copy Markdown
Member

Unfortunately I see some regression in the full benchmark:
2026-06-16T17:13:38.482979Z INFO benchmark: Total Entries: 999
2026-06-16T17:13:38.482991Z INFO benchmark: Detected (Exact): 442
2026-06-16T17:13:38.483001Z INFO benchmark: Partially Detected: 68
2026-06-16T17:13:38.483028Z INFO benchmark: Missed: 476
2026-06-16T17:13:38.483045Z INFO benchmark: Not Reviewed/Found: 12
2026-06-16T17:13:38.483060Z INFO benchmark: Skipped (No Description): 1
2026-06-16T17:13:38.483073Z INFO benchmark: Total Concerns (Before Stage 8): 13602
2026-06-16T17:13:38.483086Z INFO benchmark: Total Findings (Final Report): 3945
2026-06-16T17:13:38.483102Z INFO benchmark: --- Performance Metrics (averages per reviewed patch) ---
2026-06-16T17:13:38.483118Z INFO benchmark: Avg Tokens In: 4263271
2026-06-16T17:13:38.483133Z INFO benchmark: Avg Tokens Out: 21018
2026-06-16T17:13:38.483147Z INFO benchmark: Avg Turns: 135.1
2026-06-16T17:13:38.483166Z INFO benchmark: Avg Time: 833s

vs older
2026-05-30T04:24:23.558990Z INFO benchmark: Benchmark Complete.
2026-05-30T04:24:23.559075Z INFO benchmark: Total Entries: 999
2026-05-30T04:24:23.559098Z INFO benchmark: Detected (Exact): 467
2026-05-30T04:24:23.559118Z INFO benchmark: Partially Detected: 57
2026-05-30T04:24:23.559145Z INFO benchmark: Missed: 462
2026-05-30T04:24:23.559163Z INFO benchmark: Not Reviewed/Found: 12
2026-05-30T04:24:23.559183Z INFO benchmark: Skipped (No Description): 1
2026-05-30T04:24:23.559200Z INFO benchmark: Total Concerns (Before Stage 8): 13643
2026-05-30T04:24:23.559221Z INFO benchmark: Total Findings (Final Report): 4041
2026-05-30T04:24:23.559243Z INFO benchmark: --- Performance Metrics (averages per reviewed patch) ---
2026-05-30T04:24:23.559261Z INFO benchmark: Avg Tokens In: 3868947
2026-05-30T04:24:23.559276Z INFO benchmark: Avg Tokens Out: 20746
2026-05-30T04:24:23.559294Z INFO benchmark: Avg Turns: 123.0
2026-05-30T04:24:23.559314Z INFO benchmark: Avg Time: 779s
2026-05-30T04:24:23.559328Z INFO benchmark: Detailed results written to benchmark_results.json

--

I wonder if a better approach is to have more targeted changes to the severity.md prompt? I like your prompt, but maybe it's a bit too narrowly focused on the kvm reality, so we should be a bit more careful. I like the "speculative" part though.

@ftabba ftabba force-pushed the tabba/severity_calibration branch from ae515d7 to 35e5338 Compare June 18, 2026 15:44
@ftabba

ftabba commented Jun 18, 2026

Copy link
Copy Markdown
Author

Thanks again for running the full benchmark and your feedback. I've reworked the rubric to address the two points you raised.

On your two points, both of which were right:

  • Too KVM-specific and forced-choice. The reachability axis is now subsystem-neutral and explicitly illustrative ("examples, not a fixed set"), with an explicit escape: if reachability can't be determined from the diff, say so and don't lower the level. Undetermined reachability no longer caps the level, only reachability you've actually ruled out does. That removes the pick-from-a-bad-set case where the model hallucinates.
  • More targeted, in severity.md. I moved the calibration into severity.md as you suggested and cut the Stage 10 prompt to a short pointer. The speculative bar you liked stays: with no concrete triggering path a finding is capped at Medium and marked speculative, but still reported, not dropped for the missing path.

Subset numbers (99-patch subset of the benchmark, same model). Two baseline runs with no calibration gave 48 and 54 exact detections; the updated rubric gives 52, so recall holds, and it emits the fewest findings of the set (359 vs about 368 at baseline) at equal recall. I couldn't run the full 999 myself: it's very token-heavy and even the subset barely fit in my quota.

Could you run the full benchmark on the updated branch? And if you can share which commits regressed in your run, I can test the updated rubric against those directly rather than a random subset.

Comment thread third_party/prompts/kernel/severity.md Outdated
not let you determine reachability, say so and do not lower the level for it.

Apply a cap only in the case it names:
- Cosmetic (style, naming, or a commit-message-only problem): at most Low.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is questionable actually: a decent commit message is a must for a patch being merged. In general I want low issues to be "nice to fix, not really a reason to not merge the change as it is". Also "at most" might suggest it should be dropped.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Removing this line — the existing Low/Medium definitions already handle style, and folding a commit-message problem into Low is wrong since a weak message can block a merge. And taking your point that "at most Low" reads as "drop": nothing in the rubric should drop a real finding, only rank it.

Comment thread third_party/prompts/kernel/severity.md Outdated

Apply a cap only in the case it names:
- Cosmetic (style, naming, or a commit-message-only problem): at most Low.
- Code you have shown cannot be reached: at most Low, however damaging it

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also a bit dangerous, IMO. I think we should start with causing LLM to determine reachability and see how it works in practice. My concern is that we might pessimise a lot of real bad issues when LLM is just not smart enough to understand the reachability. So maybe let's just bump issues for being externally/remotely reachable for start?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key one, and I think you're right. Dropping the unreachable down-rank entirely. Reachability becomes an up-signal only: raise the level for a bug reachable by untrusted/remote/unprivileged input, and if reachability can't be established, leave it on consequence rather than lowering. That removes the "LLM guesses unreachable and buries a real bug" failure you're worried about, and matches "bump for externally/remotely reachable for a start."

Comment thread third_party/prompts/kernel/severity.md Outdated
- Speculative (no concrete triggering path): at most Medium.
- Critical or High requires a damaging consequence and a concrete,
non-speculative triggering path, on code you have not shown to be
unreachable. Unknown reachability does not by itself cap the level; only

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this sentence is somewhat contradictory.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it was a double-negative tangle. It disappears with the reachability change — once reachability only raises the level, there's no "unknown vs ruled-out" caveat left to state.

@ftabba ftabba force-pushed the tabba/severity_calibration branch from 35e5338 to 4596740 Compare June 18, 2026 19:19
Comment thread third_party/prompts/kernel/severity.md Outdated
- Triggering path: lay out the concrete path that reaches the bug, naming the
preconditions a caller or input must satisfy. If you cannot, because it rests
on an unproven assumption or on an ABI, register, or convention you might be
misreading, still report the finding and mark it speculative. A speculative

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"A speculative finding is at most Medium" contradicts the next paragraph. Also, please add a "catch all other options" option for the consequence list, e.g. other.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The speculative cap was sitting right next to "do not lower / leave on consequence" and reading as a contradiction. I've scoped the never-lower rule to reachability and pulled the cap onto its own line as the single deliberate down-rank: it's the one place a level is capped (at Medium), because there the open question is whether the bug is real at all, not how severe it is. The finding is still always reported, never dropped, so this only moves a label, it can't drop anything from the report.

Added "other" as a catch-all to the consequence list too.

The Stage 10 verification prompt assigns a severity in one step, which
over-states it: a finding gets Critical for a bad-sounding consequence
even when the trigger is only assumed.

Add a short calibration guide to severity.md and reduce the Stage 10
item to a pointer at it. The reviewer reasons through consequence, the
concrete triggering path, and reachability:

  - a finding with no concrete triggering path is at most Medium and is
    marked speculative, but is always reported, never dropped,
  - reachability only raises the level: bump a bug reachable by
    untrusted, remote, or unprivileged input. Do not lower a finding for
    being or looking unreachable, since reachability is hard to establish
    from a diff and a wrong call buries a real bug.

The only lever that lowers a level is an unproven triggering path (the
speculative cap). Reachability and consequence can only raise it.

Pure prompt change. The four labels, output schema, and report format
are unchanged.

Signed-off-by: Fuad Tabba <tabba@google.com>
@ftabba ftabba force-pushed the tabba/severity_calibration branch from 4596740 to dab2c3a Compare June 19, 2026 07:30
@rgushchin rgushchin merged commit f9ac906 into sashiko-dev:main Jun 22, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants