review: calibrate severity through reachability, consequence, and evidence by ftabba · Pull Request #276 · sashiko-dev/sashiko

ftabba · 2026-06-14T16:02:08Z

Reworks the severity-calibration change in this PR per the review feedback. The earlier reachability/consequence/evidence form is dropped.

The Stage 10 verification prompt assigns severity in one step, which over-states it (Critical for a bad-sounding consequence even when the code is barely reachable or the trigger is only assumed). This adds a short calibration guide to severity.md and reduces the Stage 10 item to a pointer at it. The reviewer reasons through consequence, triggering path, and reachability, and lowers the level only on the grounds the guide names:

cosmetic, or code shown to be unreachable: at most Low.
no concrete triggering path: at most Medium, marked speculative but still reported, not dropped just for the missing path.
Critical or High needs a damaging consequence and a concrete path on code not shown to be unreachable.

Reachability is given as examples, not a fixed set to choose from, with an explicit "can't determine, so don't lower the level" escape (the forced-choice point you raised). Pure prompt change: the four labels, schema, and report format are unchanged.

Measurement (99-patch subset of the benchmark, same model). Two baseline runs with no calibration gave 48 and 54 exact detections; the updated rubric gives 52, so recall holds, and it emits the fewest findings of the set (359 vs about 368 at baseline) at equal recall. The full 999 is very token-heavy and out of reach here, so this is a subset.

rgushchin · 2026-06-15T18:30:47Z

So tbh I've uneasy feelings about this change. I like the direction, but I'm also afraid that we might start missing critical bugs because sashiko is not smart enough to fully understand the reachability path. Let me run my full benchmark and see how it changes things.

ftabba · 2026-06-15T19:40:38Z

Sure. If you'd like me to test or run anything as well I'm happy to help. I've been interacting a lot with Sashiko lately, both on-list and off-list.

rgushchin · 2026-06-16T18:44:41Z

Unfortunately I see some regression in the full benchmark:
2026-06-16T17:13:38.482979Z INFO benchmark: Total Entries: 999
2026-06-16T17:13:38.482991Z INFO benchmark: Detected (Exact): 442
2026-06-16T17:13:38.483001Z INFO benchmark: Partially Detected: 68
2026-06-16T17:13:38.483028Z INFO benchmark: Missed: 476
2026-06-16T17:13:38.483045Z INFO benchmark: Not Reviewed/Found: 12
2026-06-16T17:13:38.483060Z INFO benchmark: Skipped (No Description): 1
2026-06-16T17:13:38.483073Z INFO benchmark: Total Concerns (Before Stage 8): 13602
2026-06-16T17:13:38.483086Z INFO benchmark: Total Findings (Final Report): 3945
2026-06-16T17:13:38.483102Z INFO benchmark: --- Performance Metrics (averages per reviewed patch) ---
2026-06-16T17:13:38.483118Z INFO benchmark: Avg Tokens In: 4263271
2026-06-16T17:13:38.483133Z INFO benchmark: Avg Tokens Out: 21018
2026-06-16T17:13:38.483147Z INFO benchmark: Avg Turns: 135.1
2026-06-16T17:13:38.483166Z INFO benchmark: Avg Time: 833s

vs older
2026-05-30T04:24:23.558990Z INFO benchmark: Benchmark Complete.
2026-05-30T04:24:23.559075Z INFO benchmark: Total Entries: 999
2026-05-30T04:24:23.559098Z INFO benchmark: Detected (Exact): 467
2026-05-30T04:24:23.559118Z INFO benchmark: Partially Detected: 57
2026-05-30T04:24:23.559145Z INFO benchmark: Missed: 462
2026-05-30T04:24:23.559163Z INFO benchmark: Not Reviewed/Found: 12
2026-05-30T04:24:23.559183Z INFO benchmark: Skipped (No Description): 1
2026-05-30T04:24:23.559200Z INFO benchmark: Total Concerns (Before Stage 8): 13643
2026-05-30T04:24:23.559221Z INFO benchmark: Total Findings (Final Report): 4041
2026-05-30T04:24:23.559243Z INFO benchmark: --- Performance Metrics (averages per reviewed patch) ---
2026-05-30T04:24:23.559261Z INFO benchmark: Avg Tokens In: 3868947
2026-05-30T04:24:23.559276Z INFO benchmark: Avg Tokens Out: 20746
2026-05-30T04:24:23.559294Z INFO benchmark: Avg Turns: 123.0
2026-05-30T04:24:23.559314Z INFO benchmark: Avg Time: 779s
2026-05-30T04:24:23.559328Z INFO benchmark: Detailed results written to benchmark_results.json

--

I wonder if a better approach is to have more targeted changes to the severity.md prompt? I like your prompt, but maybe it's a bit too narrowly focused on the kvm reality, so we should be a bit more careful. I like the "speculative" part though.

ftabba · 2026-06-18T15:45:18Z

Thanks again for running the full benchmark and your feedback. I've reworked the rubric to address the two points you raised.

On your two points, both of which were right:

Too KVM-specific and forced-choice. The reachability axis is now subsystem-neutral and explicitly illustrative ("examples, not a fixed set"), with an explicit escape: if reachability can't be determined from the diff, say so and don't lower the level. Undetermined reachability no longer caps the level, only reachability you've actually ruled out does. That removes the pick-from-a-bad-set case where the model hallucinates.
More targeted, in severity.md. I moved the calibration into severity.md as you suggested and cut the Stage 10 prompt to a short pointer. The speculative bar you liked stays: with no concrete triggering path a finding is capped at Medium and marked speculative, but still reported, not dropped for the missing path.

Subset numbers (99-patch subset of the benchmark, same model). Two baseline runs with no calibration gave 48 and 54 exact detections; the updated rubric gives 52, so recall holds, and it emits the fewest findings of the set (359 vs about 368 at baseline) at equal recall. I couldn't run the full 999 myself: it's very token-heavy and even the subset barely fit in my quota.

Could you run the full benchmark on the updated branch? And if you can share which commits regressed in your run, I can test the updated rubric against those directly rather than a random subset.

rgushchin · 2026-06-18T18:30:18Z

+  not let you determine reachability, say so and do not lower the level for it.
+
+Apply a cap only in the case it names:
+- Cosmetic (style, naming, or a commit-message-only problem): at most Low.


This is questionable actually: a decent commit message is a must for a patch being merged. In general I want low issues to be "nice to fix, not really a reason to not merge the change as it is". Also "at most" might suggest it should be dropped.

Agreed. Removing this line — the existing Low/Medium definitions already handle style, and folding a commit-message problem into Low is wrong since a weak message can block a merge. And taking your point that "at most Low" reads as "drop": nothing in the rubric should drop a real finding, only rank it.

rgushchin · 2026-06-18T18:32:44Z

+
+Apply a cap only in the case it names:
+- Cosmetic (style, naming, or a commit-message-only problem): at most Low.
+- Code you have shown cannot be reached: at most Low, however damaging it


Also a bit dangerous, IMO. I think we should start with causing LLM to determine reachability and see how it works in practice. My concern is that we might pessimise a lot of real bad issues when LLM is just not smart enough to understand the reachability. So maybe let's just bump issues for being externally/remotely reachable for start?

The key one, and I think you're right. Dropping the unreachable down-rank entirely. Reachability becomes an up-signal only: raise the level for a bug reachable by untrusted/remote/unprivileged input, and if reachability can't be established, leave it on consequence rather than lowering. That removes the "LLM guesses unreachable and buries a real bug" failure you're worried about, and matches "bump for externally/remotely reachable for a start."

rgushchin · 2026-06-18T18:34:09Z

+- Speculative (no concrete triggering path): at most Medium.
+- Critical or High requires a damaging consequence and a concrete,
+  non-speculative triggering path, on code you have not shown to be
+  unreachable. Unknown reachability does not by itself cap the level; only


I think this sentence is somewhat contradictory.

Agreed, it was a double-negative tangle. It disappears with the reachability change — once reachability only raises the level, there's no "unknown vs ruled-out" caveat left to state.

rgushchin · 2026-06-18T19:31:48Z

+- Triggering path: lay out the concrete path that reaches the bug, naming the
+  preconditions a caller or input must satisfy. If you cannot, because it rests
+  on an unproven assumption or on an ABI, register, or convention you might be
+  misreading, still report the finding and mark it speculative. A speculative


"A speculative finding is at most Medium" contradicts the next paragraph. Also, please add a "catch all other options" option for the consequence list, e.g. other.

The speculative cap was sitting right next to "do not lower / leave on consequence" and reading as a contradiction. I've scoped the never-lower rule to reachability and pulled the cap onto its own line as the single deliberate down-rank: it's the one place a level is capped (at Medium), because there the open question is whether the bug is real at all, not how severe it is. The finding is still always reported, never dropped, so this only moves a label, it can't drop anything from the report.

Added "other" as a catch-all to the consequence list too.

The Stage 10 verification prompt assigns a severity in one step, which over-states it: a finding gets Critical for a bad-sounding consequence even when the trigger is only assumed. Add a short calibration guide to severity.md and reduce the Stage 10 item to a pointer at it. The reviewer reasons through consequence, the concrete triggering path, and reachability: - a finding with no concrete triggering path is at most Medium and is marked speculative, but is always reported, never dropped, - reachability only raises the level: bump a bug reachable by untrusted, remote, or unprivileged input. Do not lower a finding for being or looking unreachable, since reachability is hard to establish from a diff and a wrong call buries a real bug. The only lever that lowers a level is an unproven triggering path (the speculative cap). Reachability and consequence can only raise it. Pure prompt change. The four labels, output schema, and report format are unchanged. Signed-off-by: Fuad Tabba <tabba@google.com>

ftabba mentioned this pull request Jun 14, 2026

review: anonymize author identity, calibrate severity, surface subprocess errors #232

Closed

ftabba force-pushed the tabba/severity_calibration branch from ae515d7 to 35e5338 Compare June 18, 2026 15:44

rgushchin reviewed Jun 18, 2026

View reviewed changes

ftabba force-pushed the tabba/severity_calibration branch from 35e5338 to 4596740 Compare June 18, 2026 19:19

rgushchin reviewed Jun 18, 2026

View reviewed changes

ftabba force-pushed the tabba/severity_calibration branch from 4596740 to dab2c3a Compare June 19, 2026 07:30

rgushchin merged commit f9ac906 into sashiko-dev:main Jun 22, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

review: calibrate severity through reachability, consequence, and evidence#276

review: calibrate severity through reachability, consequence, and evidence#276
rgushchin merged 1 commit into
sashiko-dev:mainfrom
ftabba:tabba/severity_calibration

ftabba commented Jun 14, 2026 •

edited

Loading

Uh oh!

rgushchin commented Jun 15, 2026

Uh oh!

ftabba commented Jun 15, 2026

Uh oh!

rgushchin commented Jun 16, 2026

Uh oh!

ftabba commented Jun 18, 2026

Uh oh!

rgushchin Jun 18, 2026

Uh oh!

ftabba Jun 18, 2026

Uh oh!

rgushchin Jun 18, 2026

Uh oh!

ftabba Jun 18, 2026

Uh oh!

rgushchin Jun 18, 2026

Uh oh!

ftabba Jun 18, 2026

Uh oh!

rgushchin Jun 18, 2026

Uh oh!

ftabba Jun 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ftabba commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rgushchin commented Jun 15, 2026

Uh oh!

ftabba commented Jun 15, 2026

Uh oh!

rgushchin commented Jun 16, 2026

Uh oh!

ftabba commented Jun 18, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ftabba commented Jun 14, 2026 •

edited

Loading