enhance: add per-category score to MultiRiskGraniteGuardianTool via logprobs#418
Merged
Conversation
…probs
### What changes I have done
- Added `score_threshold` config field to `MultiRiskGraniteGuardianToolConfig` to optionally drop low-confidence detections and reduce false positives. Defaults to `0.0` (no filter) to preserve current behavior.
- Each entry in `risk_results` now includes a per-category `score` in [0, 1], derived from the logprob of the category's first emitted token in Step 2.
- Categories whose per-category score is below `score_threshold` are dropped from `detected_risks` and `risk_results`. Categories without a score (e.g., when Ollama does not return logprobs) pass through unfiltered as a graceful fallback.
- Step 1 decision remains label-based (`Yes`/`No`) as per the model card; no extra Ollama calls are added.
### Why
The multi-harm model's self-reported text confidence (e.g., `"High"`, `"Not Harmful"`) is effectively binary in practice and useless for thresholding. False positives on some categories (e.g., `Harmful` flagged on sarcasm) could not be filtered without manual category exclusion. Per-category logprob-derived scores give callers a real numeric signal to threshold on (e.g., `Violence=0.97` vs `Harmful=0.35` for the same input).
### How I made the changes
- `akd/guardrails/providers/granite_guardian.py`:
- `MultiRiskGraniteGuardianToolConfig`: added `score_threshold: float = 0.0` with `ge=0.0, le=1.0` validation.
- `_call_category_detection`: added top-level `logprobs: True` and `top_logprobs: 5` to the Ollama `/api/generate` request body (Ollama accepts these as top-level params, not inside `options`).
- `_parse_categories_with_scores`: new helper that parses comma-separated categories and computes per-category scores as `exp(first_token_logprob)`.
- `_first_token_logprob_per_category`: new static helper that walks the token stream, skipping whitespace/commas, and returns the logprob of the first token of each emitted category.
- `_parse_categories`: kept as a thin wrapper over `_parse_categories_with_scores` for backward compatibility.
- `_arun`: applies `score_threshold` as a filter in the Step 2 detected-categories list comprehension; builds `risk_results` with `{"is_risky": True, "score": <float|None>}` per category.
### How to test
- `uv run pytest tests/guardrails/` — existing tests still pass (8 failures in `test_granite_think.py` are pre-existing and require a live `granite3.3-guardian:8b` model).
- `uv run python scripts/test_multi_harm.py` with live Ollama + `granite-guardian-3.2-5b-multi-harm-GGUF` — verifies per-category scores appear in `risk_results` (e.g., Violence=0.97, Unethical Behavior=0.76 for the same violent input; Harmful=0.35 on sarcasm, correctly flagged as low-confidence).
NISH1001
commented
Apr 13, 2026
|
❌ Tests failed (exit code: 1) 📊 Test Results
Branch: 📋 Full coverage report and logs are available in the workflow run. |
### What
- Merged `_parse_categories` and `_parse_categories_with_scores` into a single `_parse_categories` method that returns `dict[GraniteHarmCategory, float | None]` (category -> per-category score).
- Removed the redundant `"scores"` key from the Step 2 return dict; `"categories"` now holds both the categories and their scores as a dict mapping.
- `unfiltered_categories` in `extra` now preserves per-category scores alongside the category list (previously was a bare list).
### Why
The previous split had `_parse_categories` as a thin list-returning wrapper over `_parse_categories_with_scores` purely for backward compatibility, but `_parse_categories` was only called internally and had no external consumers — dead weight. A single dict return is also a more natural fit: `risk_results` is already a dict of category -> metadata, and downstream consumers need both iteration and score lookup. Dicts preserve insertion order in Python 3.7+, so the model's emission order is kept.
### How
- `akd/guardrails/providers/granite_guardian.py`:
- `_parse_categories`: now takes optional `token_logprobs`, returns `dict[GraniteHarmCategory, float | None]`. When `token_logprobs` is None/empty, scores are `None` (same behavior as the pre-logprobs version, just wrapped in dict keys instead of a list).
- `_call_category_detection`: returns `{"categories": <dict>, "raw_response": ...}` (removed the separate `"scores"` key).
- `_arun`: renamed local from `per_category_scores` to `category_scores`; iterates `category_scores.items()` directly in the filter comprehension; passes the whole `category_scores` dict as `extra["unfiltered_categories"]` to preserve scores in observability output.
### How to test
- `uv run pytest tests/guardrails/ --ignore=tests/guardrails/test_granite_think.py` — all 21 tests pass.
- `uv run python scripts/test_multi_harm.py` with live Ollama — confirms `risk_results` still has `score` per category and `extra["unfiltered_categories"]` now includes scores.
|
❌ Tests failed (exit code: 1) 📊 Test Results
Branch: 📋 Full coverage report and logs are available in the workflow run. |
Collaborator
muthukumaranR
left a comment
There was a problem hiding this comment.
I think we need to figure out if there is a MultiHarm finetuning pipeline so we are not tied to 3.2. it's already outdated by a few versions.
cc: @jbrry
Collaborator
Author
True. We should probalby figure out that soon with LORA |
muthukumaranR
approved these changes
Apr 14, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
score_thresholdconfig field toMultiRiskGraniteGuardianToolConfigto optionally drop low-confidence detections and reduce false positives. Defaults to0.0(no filter) to preserve current behavior.risk_resultsnow includes a per-categoryscorein [0, 1], derived from the logprob of the category's first emitted token in Step 2.score_thresholdare dropped fromdetected_risksandrisk_results. Categories without a score (e.g., when Ollama does not return logprobs) pass through unfiltered as a graceful fallback.Yes/No) as per the model card; no extra Ollama calls are added.Why
The multi-harm model's self-reported text confidence (e.g.,
"High","Not Harmful") is effectively binary in practice and useless for thresholding. False positives on some categories (e.g.,Harmfulflagged on sarcasm) could not be filtered without manual category exclusion. Per-category logprob-derived scores give callers a real numeric signal to threshold on (e.g.,Violence=0.97vsHarmful=0.35for the same input).Scoring methodology (first-token logprob)
We use the first token's logprob as the per-category confidence score. Here's how it works with an example:
Step 2 model output:
Violence, Unethical Behavior, HarmfulOllama returns per-token logprobs for the generated text:
Violence,UnethicalBehavior,HarmfulWe walk the token stream, skip commas/whitespace, and grab the first substantive token per category
Score =
exp(first_token_logprob):exp(-0.02)= 0.98 (1 token, first token = full category)exp(-0.70)= 0.50 (first token =Un)exp(-1.20)= 0.30 (first token =Harm)Why first-token only (not joint probability across all tokens)?
Un, it has already committed toUnethical Behavior. Subsequent tokens (eth,ical,Behavior) are near-deterministic completions (logprobs ~0) that reflect spelling ability, not category confidence.Violence(1 token) would always outscoreUnethical Behavior(4 tokens) even if the model is equally confident in both.How
akd/guardrails/providers/granite_guardian.pyscore_thresholdconfig (0.0–1.0, default 0.0) to optionally filter low-confidence categorieslogprobs: Trueandtop_logprobs: 5{category: score}dict; score filtering andrisk_resultsbuilding happen in_arunHow to test
uv run pytest tests/guardrails/— existing tests still pass (8 failures intest_granite_think.pyare pre-existing and require a livegranite3.3-guardian:8bmodel).uv run python scripts/test_multi_harm.pywith live Ollama +granite-guardian-3.2-5b-multi-harm-GGUF— verifies per-category scores appear inrisk_results(e.g., Violence=0.97, Unethical Behavior=0.76 for the same violent input; Harmful=0.35 on sarcasm, correctly flagged as low-confidence).