Skip to content

[REVIEW] ai-data-privacy: add membership inference and privacy-attack evidence gates #1213

@alejandrorivas-pixel

Description

@alejandrorivas-pixel

Skill Being Reviewed

Skill name: ai-data-privacy
Skill path: skills/ai-security/ai-data-privacy/

False Positive Analysis

Benign code/configuration that can be misclassified:

model_endpoint:
  task: internal ticket category classifier
  data_scope: synthetic and anonymized training records only
  output: top_label_only
  confidence_scores_returned: false
  nearest_training_example_debug: disabled
  query_controls:
    authentication: required
    per_user_rate_limit: 60/hour
    anomaly_detection: enabled
  privacy_evidence:
    pii_scan_on_training_data: passed
    deduplication: exact_and_near_duplicate
    dp_training: not_applicable_synthetic_data
    membership_inference_eval: not_applicable_synthetic_data_with_no_person_records

Why this is a false positive:

The current skill correctly treats memorization and PII disclosure as privacy risks, but it does not separate memorization/output PII risk from membership-inference risk. A reviewer could over-report every fine-tuned model that lacks memorization probes even when the model is trained on synthetic/anonymized data, exposes only a top label, disables debug nearest-neighbor outputs, and has query throttling. The skill needs an evidence path for Not Applicable or low-risk decisions when there is no per-person training membership to infer and no high-resolution model output exposed.

Coverage Gaps

Missed variant 1: confidence/logit exposure enables black-box membership inference

@app.post("/predict-risk")
def predict_risk(req: RiskRequest):
    probs = model.predict_proba(vectorize(req.text))[0]
    return {
        "label": labels[int(np.argmax(probs))],
        "confidence": float(np.max(probs)),
        "top_k": sorted(zip(labels, probs), key=lambda x: x[1], reverse=True)[:5],
    }

Why it should be caught:

For models trained on sensitive support tickets, health notes, fraud cases, or other person-linked records, returning calibrated probabilities, logits, entropy-like signals, or repeated top-k confidence values can expose training membership through black-box queries. The skill mentions memorization testing, but it does not require reviewers to inspect output granularity, query budgets, threshold behavior, or membership-inference evaluation evidence.

Missed variant 2: embedding and RAG endpoints leak membership through similarity scores or document IDs

@app.post("/related-cases")
def related_cases(query: str):
    hits = vector_db.similarity_search_with_score(query, k=10)
    return [
        {"case_id": h.metadata["case_id"], "score": score, "chunk_id": h.metadata["chunk_id"]}
        for h, score in hits
    ]

Why it should be caught:

Even when raw PII is not returned, high-resolution similarity scores, stable document IDs, chunk IDs, or nearest-neighbor debug data can reveal whether a target record is present in the indexed corpus. The current vector-store guidance focuses on deletion and retention, but it does not require a privacy-attack review of exposed neighbor scores, identifiers, and query repetition controls.

Missed variant 3: label-only or threshold APIs are treated as safe without privacy testing

@app.post("/eligibility")
def eligibility(req: EligibilityRequest):
    score = model(req.features)
    return {"eligible": bool(score > 0.72)}

Why it should be caught:

Removing confidence scores reduces exposure but does not automatically eliminate membership inference. Label-only attacks can still exploit decision-boundary behavior, repeated perturbation queries, and overfit models. The skill should not tell reviewers to rely on confidence masking alone; it should require residual-risk evidence, query throttling, and model overfitting/generalization checks for sensitive training sets.

Edge Cases

  • Membership inference may be out of scope for models trained only on synthetic or aggregate non-personal data. The report needs a documented Not Applicable path.
  • A provider-hosted model may not expose logits, but application-level routing, retrieval scores, refusal thresholds, or per-record debug traces can still leak membership signals.
  • Batch analytics endpoints can leak membership through small cohort counts or per-segment model metrics even when single-record prediction APIs are locked down.
  • Confidence rounding is not a complete fix if the endpoint allows repeated adaptive queries or returns stable nearest-neighbor identifiers.
  • Differential privacy claims should include the mechanism, epsilon/delta where applicable, clipping/noise configuration, and the training population scope; marketing claims alone should be Not Evaluable.
  • Privacy evidence can be sensitive. The skill should allow hashed dataset IDs, run IDs, evaluation summaries, and reviewer attestations instead of requiring raw private records in the report.

Remediation Quality

  • Fix resolves the vulnerability
  • Fix doesn't introduce new security issues
  • Fix doesn't break functionality
  • Issues found: Add a membership-inference and privacy-attack evidence gate to ai-data-privacy. For each model or retrieval endpoint processing personal or sensitive records, require reviewers to capture:
    • training/index data type: personal, aggregate, synthetic, or anonymized;
    • exposed output granularity: label only, confidence, logits, top-k probabilities, similarity scores, document IDs, debug traces;
    • query budget controls: authentication, rate limits, perturbation/anomaly detection, bulk export limits;
    • privacy evaluation: membership-inference test, label-only residual-risk review, overfitting/generalization gap, or documented Not Applicable reason;
    • embedding/RAG leakage controls: score rounding/suppression, opaque IDs, access checks before retrieval, and no nearest-training-example debug output to untrusted users;
    • mitigation evidence: differential privacy, deduplication, output minimization, cohort-size thresholds, logging/alerting, and red-team review where appropriate.

Severity guidance should distinguish:

  • Critical/High: sensitive personal training or indexed records plus exposed logits/confidence/similarity scores with no membership-inference evaluation or query controls.
  • Medium: label-only endpoint on sensitive data with no residual-risk review, overfitting evidence, or adaptive-query controls.
  • Low: documentation gaps where output minimization, privacy tests, and query controls are already in place.
  • Not Applicable: synthetic/non-personal data with no per-person membership relationship, documented in the report.

Comparison to Other Tools

Tool / Framework Catches this? Notes
Semgrep Partial Can flag probability/logit/similarity-score fields in responses, but cannot prove membership-inference risk without model/data context.
CodeQL Partial Can trace sensitive identifiers or score fields to API responses, but privacy-attack adequacy needs domain evidence.
OWASP LLM02:2025 Partial Covers sensitive information disclosure and maps to training-data membership inference, but this skill needs concrete review gates for model and embedding APIs.
NIST AI RMF 1.0 Partial Recognizes AI privacy risks including inference attacks, but teams need specific operational evidence fields.
Manual privacy red-team review Yes Can test confidence, label-only, and embedding-score exposure under realistic query budgets.

Overall Assessment

Strengths:

  • Strong lifecycle coverage across training data, prompts/completions, retention, memorization, EU AI Act requirements, and consent.
  • Practical grep guidance for prompt construction, PII filtering, retention controls, and memorization testing.
  • Good warning that embeddings are not anonymous and should follow source-data retention and deletion controls.

Needs improvement:

  • Add membership inference and model privacy attacks as a first-class evidence gate, separate from memorization.
  • Require output-granularity review for logits, confidence values, top-k probabilities, similarity scores, document IDs, and debug nearest-neighbor traces.
  • Add label-only residual-risk handling so confidence masking is not treated as a complete defense.
  • Add Not Applicable / Not Evaluable outcomes so reviewers do not over-report synthetic-data deployments or under-report untestable provider claims.

Priority recommendations:

  1. Add a Membership Inference & Privacy Attack Surface subsection after memorization risk assessment.
  2. Add report fields for output granularity, query budget controls, MI evaluation status, embedding-score exposure, mitigation evidence, and Not Applicable rationale.
  3. Add severity rules for sensitive data plus confidence/logit/similarity exposure with no MI evidence.
  4. Add search hints for predict_proba, logits, confidence, top_k, similarity_search_with_score, score_threshold, nearest, embedding, epsilon, delta, differential_privacy, and rate_limit.

Sources checked:

Bounty Info

  • I have read and agree to the CONTRIBUTING.md bounty terms
  • Preferred payment method: Payment details can be provided privately after maintainer acceptance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions