review: calibrate severity through reachability, consequence, and evidence#276
Conversation
|
So tbh I've uneasy feelings about this change. I like the direction, but I'm also afraid that we might start missing critical bugs because sashiko is not smart enough to fully understand the reachability path. Let me run my full benchmark and see how it changes things. |
|
Sure. If you'd like me to test or run anything as well I'm happy to help. I've been interacting a lot with Sashiko lately, both on-list and off-list. |
|
Unfortunately I see some regression in the full benchmark: vs older -- I wonder if a better approach is to have more targeted changes to the severity.md prompt? I like your prompt, but maybe it's a bit too narrowly focused on the kvm reality, so we should be a bit more careful. I like the "speculative" part though. |
ae515d7 to
35e5338
Compare
|
Thanks again for running the full benchmark and your feedback. I've reworked the rubric to address the two points you raised. On your two points, both of which were right:
Subset numbers (99-patch subset of the benchmark, same model). Two baseline runs with no calibration gave 48 and 54 exact detections; the updated rubric gives 52, so recall holds, and it emits the fewest findings of the set (359 vs about 368 at baseline) at equal recall. I couldn't run the full 999 myself: it's very token-heavy and even the subset barely fit in my quota. Could you run the full benchmark on the updated branch? And if you can share which commits regressed in your run, I can test the updated rubric against those directly rather than a random subset. |
| not let you determine reachability, say so and do not lower the level for it. | ||
|
|
||
| Apply a cap only in the case it names: | ||
| - Cosmetic (style, naming, or a commit-message-only problem): at most Low. |
There was a problem hiding this comment.
This is questionable actually: a decent commit message is a must for a patch being merged. In general I want low issues to be "nice to fix, not really a reason to not merge the change as it is". Also "at most" might suggest it should be dropped.
There was a problem hiding this comment.
Agreed. Removing this line — the existing Low/Medium definitions already handle style, and folding a commit-message problem into Low is wrong since a weak message can block a merge. And taking your point that "at most Low" reads as "drop": nothing in the rubric should drop a real finding, only rank it.
|
|
||
| Apply a cap only in the case it names: | ||
| - Cosmetic (style, naming, or a commit-message-only problem): at most Low. | ||
| - Code you have shown cannot be reached: at most Low, however damaging it |
There was a problem hiding this comment.
Also a bit dangerous, IMO. I think we should start with causing LLM to determine reachability and see how it works in practice. My concern is that we might pessimise a lot of real bad issues when LLM is just not smart enough to understand the reachability. So maybe let's just bump issues for being externally/remotely reachable for start?
There was a problem hiding this comment.
The key one, and I think you're right. Dropping the unreachable down-rank entirely. Reachability becomes an up-signal only: raise the level for a bug reachable by untrusted/remote/unprivileged input, and if reachability can't be established, leave it on consequence rather than lowering. That removes the "LLM guesses unreachable and buries a real bug" failure you're worried about, and matches "bump for externally/remotely reachable for a start."
| - Speculative (no concrete triggering path): at most Medium. | ||
| - Critical or High requires a damaging consequence and a concrete, | ||
| non-speculative triggering path, on code you have not shown to be | ||
| unreachable. Unknown reachability does not by itself cap the level; only |
There was a problem hiding this comment.
I think this sentence is somewhat contradictory.
There was a problem hiding this comment.
Agreed, it was a double-negative tangle. It disappears with the reachability change — once reachability only raises the level, there's no "unknown vs ruled-out" caveat left to state.
35e5338 to
4596740
Compare
| - Triggering path: lay out the concrete path that reaches the bug, naming the | ||
| preconditions a caller or input must satisfy. If you cannot, because it rests | ||
| on an unproven assumption or on an ABI, register, or convention you might be | ||
| misreading, still report the finding and mark it speculative. A speculative |
There was a problem hiding this comment.
"A speculative finding is at most Medium" contradicts the next paragraph. Also, please add a "catch all other options" option for the consequence list, e.g. other.
There was a problem hiding this comment.
The speculative cap was sitting right next to "do not lower / leave on consequence" and reading as a contradiction. I've scoped the never-lower rule to reachability and pulled the cap onto its own line as the single deliberate down-rank: it's the one place a level is capped (at Medium), because there the open question is whether the bug is real at all, not how severe it is. The finding is still always reported, never dropped, so this only moves a label, it can't drop anything from the report.
Added "other" as a catch-all to the consequence list too.
The Stage 10 verification prompt assigns a severity in one step, which
over-states it: a finding gets Critical for a bad-sounding consequence
even when the trigger is only assumed.
Add a short calibration guide to severity.md and reduce the Stage 10
item to a pointer at it. The reviewer reasons through consequence, the
concrete triggering path, and reachability:
- a finding with no concrete triggering path is at most Medium and is
marked speculative, but is always reported, never dropped,
- reachability only raises the level: bump a bug reachable by
untrusted, remote, or unprivileged input. Do not lower a finding for
being or looking unreachable, since reachability is hard to establish
from a diff and a wrong call buries a real bug.
The only lever that lowers a level is an unproven triggering path (the
speculative cap). Reachability and consequence can only raise it.
Pure prompt change. The four labels, output schema, and report format
are unchanged.
Signed-off-by: Fuad Tabba <tabba@google.com>
4596740 to
dab2c3a
Compare
Reworks the severity-calibration change in this PR per the review feedback. The earlier reachability/consequence/evidence form is dropped.
The Stage 10 verification prompt assigns severity in one step, which over-states it (Critical for a bad-sounding consequence even when the code is barely reachable or the trigger is only assumed). This adds a short calibration guide to
severity.mdand reduces the Stage 10 item to a pointer at it. The reviewer reasons through consequence, triggering path, and reachability, and lowers the level only on the grounds the guide names:Reachability is given as examples, not a fixed set to choose from, with an explicit "can't determine, so don't lower the level" escape (the forced-choice point you raised). Pure prompt change: the four labels, schema, and report format are unchanged.
Measurement (99-patch subset of the benchmark, same model). Two baseline runs with no calibration gave 48 and 54 exact detections; the updated rubric gives 52, so recall holds, and it emits the fewest findings of the set (359 vs about 368 at baseline) at equal recall. The full 999 is very token-heavy and out of reach here, so this is a subset.