PII-Safe CLI — Batch Dataset PII Sanitization

A standalone CLI tool for batch sanitization of historical datasets before they are fed into LLM pipelines. Complements the real-time PII-Safe FastAPI middleware with an offline batch processing capability.

Install

pip install .
# or in editable/dev mode:
pip install -e ".[dev]"

This registers the pii-safe command globally.

Usage

Sanitize a single file

pii-safe sanitize --input report.csv --mode pseudonymize

Sanitize a directory (recursive)

pii-safe sanitize --input ./logs/ --output ./clean/ --mode redact

Use a policy file

pii-safe sanitize --input ./logs/ --policy policy.yaml --output ./clean/

Dry-run scan (no files modified)

pii-safe scan --input ./data/

Filter entity types

pii-safe sanitize --input data.json --mode redact -e EMAIL -e SSN

CSV audit report

pii-safe sanitize --input ./data/ --mode pseudonymize --audit-format csv

Redaction Modes

Mode	Behaviour	Example output
`redact`	Replace PII with a static label	`[EMAIL]`, `[SSN]`
`pseudonymize`	Replace with a session token (reversible via audit map)	`EMAIL_01`, `SSN_02`
`block`	Refuse to output the entire record if PII is detected	`[RECORD BLOCKED — PII DETECTED]`

Supported Formats

Extension	Processor
`.csv`	Header preserved; each cell scanned independently
`.json`	Recursive walk of all string values; non-strings (int, bool) untouched
`.txt`, `.log`, `.text`	Line-by-line scanning
Any other extension	Treated as plain text

Built-in Entity Types

EMAIL · SSN · PHONE · CREDIT_CARD · IP_ADDR · DATE · ZIP_CODE · URL

Add custom patterns in policy.yaml:

extra_patterns:
  EMPLOYEE_ID: "EMP-\\d{6}"

Policy File

# policy.yaml
mode: pseudonymize
entity_types:
  - EMAIL
  - SSN
output_format: json        # audit report format: json | csv
extra_patterns:
  EMPLOYEE_ID: "EMP-\\d{6}"
api_url: http://localhost:8000   # optional: delegate to FastAPI backend

Audit Report

Every run produces a downloadable audit report alongside the sanitized output:

out/
├── data_sanitized.csv
└── audit_report.json      ← interceptions: entity type, hash, placeholder, line

The report never stores raw PII — only SHA-256 hashes of intercepted values.

JSON report structure:

{
  "generated_at": "2026-04-18T...",
  "mode": "pseudonymize",
  "summary": {
    "total_interceptions": 42,
    "by_entity_type": {"EMAIL": 20, "SSN": 22},
    "by_file": {"data.csv": 42}
  },
  "events": [...]
}

FastAPI Backend Integration

If the PII-Safe Docker stack (Issue #7) is running, add api_url to your policy file to delegate detection to the backend instead of the local regex engine:

api_url: http://localhost:8000

File Structure

pii_safe_cli/
├── __init__.py
├── cli.py          # Click commands: sanitize, scan
├── detector.py     # PIIDetector — regex engine + custom patterns
├── redactor.py     # Redactor — redact / pseudonymize / block
├── processors.py   # Format handlers: CSV, JSON, plain text
├── audit.py        # Audit report generator (JSON + CSV)
└── policy.py       # YAML policy loader
tests/
└── test_cli.py     # 56 tests across all modules
policy.yaml         # Example policy file
pyproject.toml      # Package config + CLI entrypoint

Running Tests

pytest tests/ -v

56 tests, all passing across: detector, redactor (all 3 modes), file processors, audit report, policy loader, CLI commands.

License

Apache 2.0 — consistent with the parent PII-Safe project.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.gitignore		.gitignore
Audit.py		Audit.py
Cli.py		Cli.py
Detector.py		Detector.py
Policy.py		Policy.py
Policy.yaml		Policy.yaml
Processors.py		Processors.py
Pyproject.toml		Pyproject.toml
README.md		README.md
Redactor.py		Redactor.py
Test cli.py		Test cli.py
init.py		init.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PII-Safe CLI — Batch Dataset PII Sanitization

Install

Usage

Sanitize a single file

Sanitize a directory (recursive)

Use a policy file

Dry-run scan (no files modified)

Filter entity types

CSV audit report

Redaction Modes

Supported Formats

Built-in Entity Types

Policy File

Audit Report

FastAPI Backend Integration

File Structure

Running Tests

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PII-Safe CLI — Batch Dataset PII Sanitization

Install

Usage

Sanitize a single file

Sanitize a directory (recursive)

Use a policy file

Dry-run scan (no files modified)

Filter entity types

CSV audit report

Redaction Modes

Supported Formats

Built-in Entity Types

Policy File

Audit Report

FastAPI Backend Integration

File Structure

Running Tests

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages