[REVIEW] ai-data-privacy: add membership inference and privacy-attack evidence gates

## Skill Being Reviewed
**Skill name:** `ai-data-privacy`
**Skill path:** `skills/ai-security/ai-data-privacy/`

## False Positive Analysis

**Benign code/configuration that can be misclassified:**
```yaml
model_endpoint:
  task: internal ticket category classifier
  data_scope: synthetic and anonymized training records only
  output: top_label_only
  confidence_scores_returned: false
  nearest_training_example_debug: disabled
  query_controls:
    authentication: required
    per_user_rate_limit: 60/hour
    anomaly_detection: enabled
  privacy_evidence:
    pii_scan_on_training_data: passed
    deduplication: exact_and_near_duplicate
    dp_training: not_applicable_synthetic_data
    membership_inference_eval: not_applicable_synthetic_data_with_no_person_records
```

**Why this is a false positive:**

The current skill correctly treats memorization and PII disclosure as privacy risks, but it does not separate memorization/output PII risk from membership-inference risk. A reviewer could over-report every fine-tuned model that lacks memorization probes even when the model is trained on synthetic/anonymized data, exposes only a top label, disables debug nearest-neighbor outputs, and has query throttling. The skill needs an evidence path for `Not Applicable` or low-risk decisions when there is no per-person training membership to infer and no high-resolution model output exposed.

## Coverage Gaps

**Missed variant 1: confidence/logit exposure enables black-box membership inference**
```python
@app.post("/predict-risk")
def predict_risk(req: RiskRequest):
    probs = model.predict_proba(vectorize(req.text))[0]
    return {
        "label": labels[int(np.argmax(probs))],
        "confidence": float(np.max(probs)),
        "top_k": sorted(zip(labels, probs), key=lambda x: x[1], reverse=True)[:5],
    }
```
**Why it should be caught:**

For models trained on sensitive support tickets, health notes, fraud cases, or other person-linked records, returning calibrated probabilities, logits, entropy-like signals, or repeated top-k confidence values can expose training membership through black-box queries. The skill mentions memorization testing, but it does not require reviewers to inspect output granularity, query budgets, threshold behavior, or membership-inference evaluation evidence.

**Missed variant 2: embedding and RAG endpoints leak membership through similarity scores or document IDs**
```python
@app.post("/related-cases")
def related_cases(query: str):
    hits = vector_db.similarity_search_with_score(query, k=10)
    return [
        {"case_id": h.metadata["case_id"], "score": score, "chunk_id": h.metadata["chunk_id"]}
        for h, score in hits
    ]
```
**Why it should be caught:**

Even when raw PII is not returned, high-resolution similarity scores, stable document IDs, chunk IDs, or nearest-neighbor debug data can reveal whether a target record is present in the indexed corpus. The current vector-store guidance focuses on deletion and retention, but it does not require a privacy-attack review of exposed neighbor scores, identifiers, and query repetition controls.

**Missed variant 3: label-only or threshold APIs are treated as safe without privacy testing**
```python
@app.post("/eligibility")
def eligibility(req: EligibilityRequest):
    score = model(req.features)
    return {"eligible": bool(score > 0.72)}
```
**Why it should be caught:**

Removing confidence scores reduces exposure but does not automatically eliminate membership inference. Label-only attacks can still exploit decision-boundary behavior, repeated perturbation queries, and overfit models. The skill should not tell reviewers to rely on confidence masking alone; it should require residual-risk evidence, query throttling, and model overfitting/generalization checks for sensitive training sets.

## Edge Cases

- Membership inference may be out of scope for models trained only on synthetic or aggregate non-personal data. The report needs a documented `Not Applicable` path.
- A provider-hosted model may not expose logits, but application-level routing, retrieval scores, refusal thresholds, or per-record debug traces can still leak membership signals.
- Batch analytics endpoints can leak membership through small cohort counts or per-segment model metrics even when single-record prediction APIs are locked down.
- Confidence rounding is not a complete fix if the endpoint allows repeated adaptive queries or returns stable nearest-neighbor identifiers.
- Differential privacy claims should include the mechanism, epsilon/delta where applicable, clipping/noise configuration, and the training population scope; marketing claims alone should be `Not Evaluable`.
- Privacy evidence can be sensitive. The skill should allow hashed dataset IDs, run IDs, evaluation summaries, and reviewer attestations instead of requiring raw private records in the report.

## Remediation Quality

- [x] Fix resolves the vulnerability
- [x] Fix doesn't introduce new security issues
- [x] Fix doesn't break functionality
- **Issues found:** Add a membership-inference and privacy-attack evidence gate to `ai-data-privacy`. For each model or retrieval endpoint processing personal or sensitive records, require reviewers to capture:
  - training/index data type: personal, aggregate, synthetic, or anonymized;
  - exposed output granularity: label only, confidence, logits, top-k probabilities, similarity scores, document IDs, debug traces;
  - query budget controls: authentication, rate limits, perturbation/anomaly detection, bulk export limits;
  - privacy evaluation: membership-inference test, label-only residual-risk review, overfitting/generalization gap, or documented `Not Applicable` reason;
  - embedding/RAG leakage controls: score rounding/suppression, opaque IDs, access checks before retrieval, and no nearest-training-example debug output to untrusted users;
  - mitigation evidence: differential privacy, deduplication, output minimization, cohort-size thresholds, logging/alerting, and red-team review where appropriate.

Severity guidance should distinguish:

- Critical/High: sensitive personal training or indexed records plus exposed logits/confidence/similarity scores with no membership-inference evaluation or query controls.
- Medium: label-only endpoint on sensitive data with no residual-risk review, overfitting evidence, or adaptive-query controls.
- Low: documentation gaps where output minimization, privacy tests, and query controls are already in place.
- Not Applicable: synthetic/non-personal data with no per-person membership relationship, documented in the report.

## Comparison to Other Tools

| Tool / Framework | Catches this? | Notes |
|------|:---:|-------|
| Semgrep | Partial | Can flag probability/logit/similarity-score fields in responses, but cannot prove membership-inference risk without model/data context. |
| CodeQL | Partial | Can trace sensitive identifiers or score fields to API responses, but privacy-attack adequacy needs domain evidence. |
| OWASP LLM02:2025 | Partial | Covers sensitive information disclosure and maps to training-data membership inference, but this skill needs concrete review gates for model and embedding APIs. |
| NIST AI RMF 1.0 | Partial | Recognizes AI privacy risks including inference attacks, but teams need specific operational evidence fields. |
| Manual privacy red-team review | Yes | Can test confidence, label-only, and embedding-score exposure under realistic query budgets. |

## Overall Assessment

**Strengths:**

- Strong lifecycle coverage across training data, prompts/completions, retention, memorization, EU AI Act requirements, and consent.
- Practical grep guidance for prompt construction, PII filtering, retention controls, and memorization testing.
- Good warning that embeddings are not anonymous and should follow source-data retention and deletion controls.

**Needs improvement:**

- Add membership inference and model privacy attacks as a first-class evidence gate, separate from memorization.
- Require output-granularity review for logits, confidence values, top-k probabilities, similarity scores, document IDs, and debug nearest-neighbor traces.
- Add label-only residual-risk handling so confidence masking is not treated as a complete defense.
- Add `Not Applicable` / `Not Evaluable` outcomes so reviewers do not over-report synthetic-data deployments or under-report untestable provider claims.

**Priority recommendations:**
1. Add a `Membership Inference & Privacy Attack Surface` subsection after memorization risk assessment.
2. Add report fields for output granularity, query budget controls, MI evaluation status, embedding-score exposure, mitigation evidence, and `Not Applicable` rationale.
3. Add severity rules for sensitive data plus confidence/logit/similarity exposure with no MI evidence.
4. Add search hints for `predict_proba`, `logits`, `confidence`, `top_k`, `similarity_search_with_score`, `score_threshold`, `nearest`, `embedding`, `epsilon`, `delta`, `differential_privacy`, and `rate_limit`.

**Sources checked:**
- OWASP LLM02:2025 Sensitive Information Disclosure: https://genai.owasp.org/llmrisk/llm02-insecure-output-handling/
- NIST AI RMF 1.0: https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf
- Shokri et al., "Membership Inference Attacks Against Machine Learning Models": https://arxiv.org/abs/1610.05820
- Choquette-Choo et al., "Label-Only Membership Inference Attacks": https://arxiv.org/abs/2007.14321

## Bounty Info
- [x] I have read and agree to the [CONTRIBUTING.md](../../CONTRIBUTING.md) bounty terms
- **Preferred payment method:** Payment details can be provided privately after maintainer acceptance.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] ai-data-privacy: add membership inference and privacy-attack evidence gates #1213

Skill Being Reviewed

False Positive Analysis

Coverage Gaps

Edge Cases

Remediation Quality

Comparison to Other Tools

Overall Assessment

Bounty Info

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Tool / Framework	Catches this?	Notes
Semgrep	Partial	Can flag probability/logit/similarity-score fields in responses, but cannot prove membership-inference risk without model/data context.
CodeQL	Partial	Can trace sensitive identifiers or score fields to API responses, but privacy-attack adequacy needs domain evidence.
OWASP LLM02:2025	Partial	Covers sensitive information disclosure and maps to training-data membership inference, but this skill needs concrete review gates for model and embedding APIs.
NIST AI RMF 1.0	Partial	Recognizes AI privacy risks including inference attacks, but teams need specific operational evidence fields.
Manual privacy red-team review	Yes	Can test confidence, label-only, and embedding-score exposure under realistic query budgets.

[REVIEW] ai-data-privacy: add membership inference and privacy-attack evidence gates #1213

Description

Skill Being Reviewed

False Positive Analysis

Coverage Gaps

Edge Cases

Remediation Quality

Comparison to Other Tools

Overall Assessment

Bounty Info

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions