You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Benign code/configuration that can be misclassified:
model_endpoint:
task: internal ticket category classifierdata_scope: synthetic and anonymized training records onlyoutput: top_label_onlyconfidence_scores_returned: falsenearest_training_example_debug: disabledquery_controls:
authentication: requiredper_user_rate_limit: 60/houranomaly_detection: enabledprivacy_evidence:
pii_scan_on_training_data: passeddeduplication: exact_and_near_duplicatedp_training: not_applicable_synthetic_datamembership_inference_eval: not_applicable_synthetic_data_with_no_person_records
Why this is a false positive:
The current skill correctly treats memorization and PII disclosure as privacy risks, but it does not separate memorization/output PII risk from membership-inference risk. A reviewer could over-report every fine-tuned model that lacks memorization probes even when the model is trained on synthetic/anonymized data, exposes only a top label, disables debug nearest-neighbor outputs, and has query throttling. The skill needs an evidence path for Not Applicable or low-risk decisions when there is no per-person training membership to infer and no high-resolution model output exposed.
For models trained on sensitive support tickets, health notes, fraud cases, or other person-linked records, returning calibrated probabilities, logits, entropy-like signals, or repeated top-k confidence values can expose training membership through black-box queries. The skill mentions memorization testing, but it does not require reviewers to inspect output granularity, query budgets, threshold behavior, or membership-inference evaluation evidence.
Missed variant 2: embedding and RAG endpoints leak membership through similarity scores or document IDs
Even when raw PII is not returned, high-resolution similarity scores, stable document IDs, chunk IDs, or nearest-neighbor debug data can reveal whether a target record is present in the indexed corpus. The current vector-store guidance focuses on deletion and retention, but it does not require a privacy-attack review of exposed neighbor scores, identifiers, and query repetition controls.
Missed variant 3: label-only or threshold APIs are treated as safe without privacy testing
Removing confidence scores reduces exposure but does not automatically eliminate membership inference. Label-only attacks can still exploit decision-boundary behavior, repeated perturbation queries, and overfit models. The skill should not tell reviewers to rely on confidence masking alone; it should require residual-risk evidence, query throttling, and model overfitting/generalization checks for sensitive training sets.
Edge Cases
Membership inference may be out of scope for models trained only on synthetic or aggregate non-personal data. The report needs a documented Not Applicable path.
A provider-hosted model may not expose logits, but application-level routing, retrieval scores, refusal thresholds, or per-record debug traces can still leak membership signals.
Batch analytics endpoints can leak membership through small cohort counts or per-segment model metrics even when single-record prediction APIs are locked down.
Confidence rounding is not a complete fix if the endpoint allows repeated adaptive queries or returns stable nearest-neighbor identifiers.
Differential privacy claims should include the mechanism, epsilon/delta where applicable, clipping/noise configuration, and the training population scope; marketing claims alone should be Not Evaluable.
Privacy evidence can be sensitive. The skill should allow hashed dataset IDs, run IDs, evaluation summaries, and reviewer attestations instead of requiring raw private records in the report.
Remediation Quality
Fix resolves the vulnerability
Fix doesn't introduce new security issues
Fix doesn't break functionality
Issues found: Add a membership-inference and privacy-attack evidence gate to ai-data-privacy. For each model or retrieval endpoint processing personal or sensitive records, require reviewers to capture:
training/index data type: personal, aggregate, synthetic, or anonymized;
privacy evaluation: membership-inference test, label-only residual-risk review, overfitting/generalization gap, or documented Not Applicable reason;
embedding/RAG leakage controls: score rounding/suppression, opaque IDs, access checks before retrieval, and no nearest-training-example debug output to untrusted users;
mitigation evidence: differential privacy, deduplication, output minimization, cohort-size thresholds, logging/alerting, and red-team review where appropriate.
Severity guidance should distinguish:
Critical/High: sensitive personal training or indexed records plus exposed logits/confidence/similarity scores with no membership-inference evaluation or query controls.
Medium: label-only endpoint on sensitive data with no residual-risk review, overfitting evidence, or adaptive-query controls.
Low: documentation gaps where output minimization, privacy tests, and query controls are already in place.
Not Applicable: synthetic/non-personal data with no per-person membership relationship, documented in the report.
Comparison to Other Tools
Tool / Framework
Catches this?
Notes
Semgrep
Partial
Can flag probability/logit/similarity-score fields in responses, but cannot prove membership-inference risk without model/data context.
CodeQL
Partial
Can trace sensitive identifiers or score fields to API responses, but privacy-attack adequacy needs domain evidence.
OWASP LLM02:2025
Partial
Covers sensitive information disclosure and maps to training-data membership inference, but this skill needs concrete review gates for model and embedding APIs.
NIST AI RMF 1.0
Partial
Recognizes AI privacy risks including inference attacks, but teams need specific operational evidence fields.
Manual privacy red-team review
Yes
Can test confidence, label-only, and embedding-score exposure under realistic query budgets.
Overall Assessment
Strengths:
Strong lifecycle coverage across training data, prompts/completions, retention, memorization, EU AI Act requirements, and consent.
Practical grep guidance for prompt construction, PII filtering, retention controls, and memorization testing.
Good warning that embeddings are not anonymous and should follow source-data retention and deletion controls.
Needs improvement:
Add membership inference and model privacy attacks as a first-class evidence gate, separate from memorization.
Require output-granularity review for logits, confidence values, top-k probabilities, similarity scores, document IDs, and debug nearest-neighbor traces.
Add label-only residual-risk handling so confidence masking is not treated as a complete defense.
Add Not Applicable / Not Evaluable outcomes so reviewers do not over-report synthetic-data deployments or under-report untestable provider claims.
Priority recommendations:
Add a Membership Inference & Privacy Attack Surface subsection after memorization risk assessment.
Add report fields for output granularity, query budget controls, MI evaluation status, embedding-score exposure, mitigation evidence, and Not Applicable rationale.
Add severity rules for sensitive data plus confidence/logit/similarity exposure with no MI evidence.
Add search hints for predict_proba, logits, confidence, top_k, similarity_search_with_score, score_threshold, nearest, embedding, epsilon, delta, differential_privacy, and rate_limit.
Skill Being Reviewed
Skill name:
ai-data-privacySkill path:
skills/ai-security/ai-data-privacy/False Positive Analysis
Benign code/configuration that can be misclassified:
Why this is a false positive:
The current skill correctly treats memorization and PII disclosure as privacy risks, but it does not separate memorization/output PII risk from membership-inference risk. A reviewer could over-report every fine-tuned model that lacks memorization probes even when the model is trained on synthetic/anonymized data, exposes only a top label, disables debug nearest-neighbor outputs, and has query throttling. The skill needs an evidence path for
Not Applicableor low-risk decisions when there is no per-person training membership to infer and no high-resolution model output exposed.Coverage Gaps
Missed variant 1: confidence/logit exposure enables black-box membership inference
Why it should be caught:
For models trained on sensitive support tickets, health notes, fraud cases, or other person-linked records, returning calibrated probabilities, logits, entropy-like signals, or repeated top-k confidence values can expose training membership through black-box queries. The skill mentions memorization testing, but it does not require reviewers to inspect output granularity, query budgets, threshold behavior, or membership-inference evaluation evidence.
Missed variant 2: embedding and RAG endpoints leak membership through similarity scores or document IDs
Why it should be caught:
Even when raw PII is not returned, high-resolution similarity scores, stable document IDs, chunk IDs, or nearest-neighbor debug data can reveal whether a target record is present in the indexed corpus. The current vector-store guidance focuses on deletion and retention, but it does not require a privacy-attack review of exposed neighbor scores, identifiers, and query repetition controls.
Missed variant 3: label-only or threshold APIs are treated as safe without privacy testing
Why it should be caught:
Removing confidence scores reduces exposure but does not automatically eliminate membership inference. Label-only attacks can still exploit decision-boundary behavior, repeated perturbation queries, and overfit models. The skill should not tell reviewers to rely on confidence masking alone; it should require residual-risk evidence, query throttling, and model overfitting/generalization checks for sensitive training sets.
Edge Cases
Not Applicablepath.Not Evaluable.Remediation Quality
ai-data-privacy. For each model or retrieval endpoint processing personal or sensitive records, require reviewers to capture:Not Applicablereason;Severity guidance should distinguish:
Comparison to Other Tools
Overall Assessment
Strengths:
Needs improvement:
Not Applicable/Not Evaluableoutcomes so reviewers do not over-report synthetic-data deployments or under-report untestable provider claims.Priority recommendations:
Membership Inference & Privacy Attack Surfacesubsection after memorization risk assessment.Not Applicablerationale.predict_proba,logits,confidence,top_k,similarity_search_with_score,score_threshold,nearest,embedding,epsilon,delta,differential_privacy, andrate_limit.Sources checked:
Bounty Info