Skip to content

igors93/logprivacy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LogPrivacy

CI PyPI Python Versions status beta typing typed dependencies zero license MIT


For beginners: Every developer has accidentally logged a password, API key, or email address. LogPrivacy adds one import to your project and prevents those accidents automatically — no infrastructure changes, no configuration required.

For advanced teams: LogPrivacy implements a three-phase redaction pipeline (scan → resolve overlaps → redact) with configurable policy composition, HMAC pseudonymization, declarative JSON policies, structured traversal with depth/item budgets, and fail-closed semantics throughout. Zero runtime dependencies. Fully typed.

Works on Linux, macOS, and Windows · Python 3.10–3.14 · Zero third-party dependencies


The problem in two lines

# What happens WITHOUT LogPrivacy
logger.warning("Login failed for user=%s token=%s", email, api_token)
# → WARNING: Login failed for user=john@company.com token=sk_live_abc123XYZ
# What happens WITH LogPrivacy
logger = get_safe_logger(__name__)
logger.warning("Login failed for user=%s token=%s", email, api_token)
# → WARNING: Login failed for user=[EMAIL] token=[SECRET]

Table of Contents

How It Works

Every value you pass to LogPrivacy goes through a three-phase pipeline before being returned to you. Understanding this makes it easier to predict behavior, write custom rules, and tune performance.

flowchart LR
    IN["Input\nstr · dict · list\nLogRecord · bytes"]

    subgraph PIPE ["  Redaction Pipeline  "]
        direction LR
        S["Scanner\nfinds all candidate\nmatches — O(n) per rule"]
        R["Resolver\ndrops overlapping matches\nlongest-match wins"]
        D["Redactor\napplies masking strategy\nper category & policy"]
    end

    subgraph CFG ["Policy & Rules"]
        direction TB
        RU["Rule Registry\nemail · credential · token\nbearer · API key · URL\ncredit card · IP · phone"]
        PO["CleanerPolicy\nmasking · limits\nsensitive_keys\nblock_categories\nfield_rules · path_rules"]
    end

    OUT["Safe Output\ncleaned value\n+ RedactionResult"]

    IN --> S
    S --> R
    R --> D
    D --> OUT
    RU --> S
    PO --> S
    PO --> D
Loading

Scanner — Each rule runs finditer (or find_limited) over the text and reports candidate matches with their category, start, and end positions.

Resolver — When two rules match overlapping spans (e.g. an email inside a URL), the resolver keeps the longest non-overlapping match. Ties are broken by rule order.

Redactor — Applies the configured masking strategy (placeholder, partial, or hash) for each winning match, and assembles the final sanitized string.

Structured values (dicts, lists, dataclasses, exceptions) are traversed recursively. The same pipeline runs on every leaf string, with configurable depth and item budgets.

From log record to safe output

The sequence below shows exactly how a logger.warning() call flows through the redaction filter before reaching any handler.

sequenceDiagram
    actor App as Application Code
    participant L  as Python Logger
    participant F  as LogPrivacyFilter
    participant E  as Redaction Engine
    participant H  as Log Handler

    App->>L: logger.warning("token=%s", tok)
    L->>F: filter(LogRecord)
    activate F
    Note over F: renders message + args<br/>hardens extra fields
    F->>E: clean(rendered_message)
    activate E
    E-->>F: "token=[SECRET]"
    deactivate E
    Note over F: clears record.args<br/>replaces exc_info if present
    F-->>L: True  (record modified in-place)
    deactivate F
    L->>H: emit(safe_record)
    H-->>App: [WARNING] token=[SECRET]
Loading

The filter modifies the LogRecord in place and clears .args so the original sensitive values cannot be recovered downstream by any handler.

What gets detected

pie title Detection categories — default policy
    "Credentials & passwords" : 25
    "Emails" : 15
    "Tokens & JWTs" : 20
    "API keys & secrets" : 20
    "URLs with sensitive params" : 10
    "Credit cards (Luhn-validated)" : 10
Loading

Strict policy additionally detects IP addresses and phone numbers. See Policies.


Installation

pip install logprivacy

Requires Python 3.10 or later. No other dependencies.


Which API should I use?

I want to… Use
Clean a string, dict, list, or any value clean()
Print safely during debugging safe_print()
Use Python's logging module safely get_safe_logger()
Inspect what would be redacted without modifying audit()
Fail a test when a log message leaks a secret assert_clean()
Sanitize a URL while keeping safe query params clean_url()
Scan or clean an old log file scan_file() / clean_file()
Stream or clean a JSONL file scan_jsonl() / clean_jsonl()
Get full result metadata alongside cleaned output clean_with_result() / to_safe_data_with_result()
Serialize structured data safely for JSON / APIs to_safe_data() / safe_json_dumps()

See docs/guides/which-api.md for a longer decision guide.


Quick Start

from logprivacy import clean

message = "Login failed for john@example.com with password=123456"
print(clean(message))
# Login failed for [EMAIL] with password=[SECRET]

clean() accepts strings, dicts, lists, tuples, and bytes. The return type matches the input type. For dataclasses, exceptions, and custom objects use to_safe_data().


Safe Print

A drop-in replacement for print() during debugging. Nothing sensitive ever reaches your terminal or captured output:

from logprivacy import safe_print

user = {"email": "john@example.com", "token": "sk_live_abc123", "status": "active"}
safe_print("Payload:", user)
# Payload: {'email': '[EMAIL]', 'token': '[SECRET]', 'status': 'active'}

Supports all print() arguments (sep, end, file, flush).


Safe Logger

Wraps any Python logging.Logger with a redaction filter. Your logging setup stays unchanged — you only swap how you get the logger:

import logging
from logprivacy import get_safe_logger

logging.basicConfig(level=logging.INFO)
logger = get_safe_logger(__name__)

logger.warning("User john@example.com used password=123456")
# WARNING:__main__:User [EMAIL] used password=[SECRET]

# Works with structured extra fields too
logger.info("Request complete", extra={"auth_token": "Bearer abc123xyz"})
# extra fields are cleaned before any handler sees the record

The filter is attached once per logger name. Calling get_safe_logger() again on the same name reuses the existing filter without creating a duplicate.

Exception tracebacks are also sanitized — if an exception message contains a secret, it is cleaned before any handler formats the record.


Audit Before Logging

Inspect what would be redacted without modifying the input. Useful for routing logic, metrics, or conditional alerting:

from logprivacy import audit

report = audit({"password": "123456", "email": "john@example.com"})

print(report.safe)          # False
print(report.risk_level)    # "high"
print(report.categories)    # ("credential", "email")
print(report.describe())
# Sensitive data detected at ['password', 'email']:
#   [credential] at path password — sensitive key
#   [email]      at path email    — matched email pattern

audit() traverses dicts, lists, and tuples recursively. Sensitive dictionary keys (like "password", "api_key", "secret") are always reported as credential findings even when the value does not match any text pattern.


Fail Tests When Logs Are Unsafe

Integrate LogPrivacy into your test suite to make sensitive leaks a test failure, not a production incident:

from logprivacy import assert_clean

def test_log_message_has_no_sensitive_data():
    assert_clean("operation finished successfully")  # passes

def test_response_dict_is_safe():
    assert_clean({"username": "john", "status": "active"})  # passes

def test_catches_accidental_leak():
    assert_clean("sent to john@company.com with token=abc123")
    # raises LogPrivacyAssertionError:
    # Sensitive data found in 2 location(s):
    #   [email]      at root — matched email pattern
    #   [credential] at root — matched secret pattern

assert_clean() raises LogPrivacyAssertionError with a human-readable description of every finding, including its path in nested structures.


Clean Structured Data

from logprivacy import clean

payload = {
    "user": {
        "email": "john@example.com",
        "password": "s3cr3t",
    },
    "metadata": {
        "request_id": "req-abc123",
        "status": "failed",
    },
}

print(clean(payload))
# {
#   'user': {'email': '[EMAIL]', 'password': '[SECRET]'},
#   'metadata': {'request_id': 'req-abc123', 'status': 'failed'}
# }

Nested dicts and lists are traversed recursively up to a configurable depth limit (default: 20). Sensitive dictionary keys (password, api_key, secret, etc.) are redacted even when the value is empty or does not match a regex pattern. For dataclasses and exceptions, use to_safe_data().


Structured and JSON-safe Data

Use to_safe_data() when the output must be safe to pass to JSON encoders or external APIs. It converts supported Python types recursively and fails closed for unsupported objects — no sensitive data leaks via repr() or str():

from logprivacy import (
    AdapterRegistry,
    CleanerPolicy,
    FieldRule,
    safe_json_dumps,
    to_safe_data,
    to_safe_data_with_result,
)

# Basic usage
to_safe_data({"email": "john@example.com", "password": "123"})
# {"email": "[EMAIL]", "password": "[SECRET]"}

# Direct JSON serialization
safe_json_dumps({"token": "abc123456789"})
# '{"token": "[SECRET]"}'

# Custom type adapter — teach LogPrivacy how to convert your domain objects
class Request:
    def __init__(self, identifier: str, token: str) -> None:
        self.identifier = identifier
        self.token = token

adapters = AdapterRegistry.default()
adapters.register(Request, lambda v: {"id": v.identifier, "token": v.token})
to_safe_data(Request("req-1", "abc123456789"), adapters=adapters)
# {"id": "req-1", "token": "[SECRET]"}

# Field-level rules — fine-grained control per field name
policy = CleanerPolicy.default().add_field_rules(
    FieldRule.exact("raw_body", action="truncate", max_chars=500),
    FieldRule.contains("secret", action="remove"),
)

# Completeness metadata — know when output is partial
result = to_safe_data_with_result({"token": "abc", "name": "Alice"})
print(result.complete)        # True  — all fields processed
print(result.stats.masked)    # 1     — one value was masked
print(result.stats.removed)   # 0

See docs/data/structured-data.md for supported types, field-rule actions, adapters, and JSON serialization details.


JSONL Streaming

Process JSONL (newline-delimited JSON) files line by line without loading the entire file into memory:

from logprivacy import scan_jsonl, clean_jsonl, iter_safe_jsonl, safe_jsonl_write

# Scan for findings without modifying the file
for record in scan_jsonl("app.jsonl"):
    print(f"line {record.line_number}: {len(record.findings)} finding(s)")

# Clean atomically — writes to a temp file, then os.replace()
# The original is never partially overwritten on failure
clean_jsonl("app.jsonl", output="app.clean.jsonl")

# Stream cleaned records one at a time (memory-efficient)
for record in iter_safe_jsonl("app.jsonl"):
    forward_to_downstream(record)

# Write clean records directly
with open("output.jsonl", "w") as f:
    safe_jsonl_write([{"email": "john@example.com", "status": "ok"}], f)

Clean URLs

Sanitize sensitive query parameters while keeping safe context readable:

from logprivacy import clean_url

url = "https://api.example.com/users?page=1&token=abc123&email=john@example.com"
print(clean_url(url))
# https://api.example.com/users?page=1&token=[SECRET]&email=[EMAIL]

Safe parameters like page, sort, and limit are preserved unchanged. Sensitive ones — token, api_key, email, password, and any key that matches sensitive_keys in your policy — are replaced with placeholders.


Masking Styles

Three built-in strategies are available. Choose based on what you need to preserve:

from logprivacy import Cleaner, CleanerPolicy

# Placeholder (default) — maximum privacy, minimum context
Cleaner(CleanerPolicy.default(masking="placeholder"))

# Partial — shows prefix/suffix, useful for correlation without exposure
Cleaner(CleanerPolicy.default(masking="partial"))

# Hash — stable opaque token, same input always gives same output
Cleaner(CleanerPolicy.default(masking="hash"))
Input placeholder partial hash
john@example.com [EMAIL] j***@example.com [EMAIL:855f96e9]
sk_live_abcdef123456 [SECRET] sk_l********3456 [SECRET:3c6e0b8a]
Bearer eyJhbGci... [TOKEN] [TOKEN] [TOKEN:7f4a1b2c]

Hash pseudonymization

For cases where you need stable, deterministic tokens without exposing the original value — compliance logging, analytics across services, A/B testing:

from logprivacy import HashMaskingStrategy, CleanerPolicy, clean

policy = CleanerPolicy.default().with_masking(
    HashMaskingStrategy(key=b"your-32-byte-minimum-secret-key!")
)
clean("john@example.com", policy=policy)
# → [EMAIL:855f96e9f4e27c0b]

# Same input + same key = same token, always
# Different key = entirely different tokens (key rotation)

This is pseudonymization, not anonymization — a party with the key can re-derive any token from the original value. The key is never stored in repr(), str(), serialization, or exception messages.

For field-level pseudonymization in structured data, use HMACMaskingStrategy with with_pseudonymizer() and a PathRule with action="pseudonymize" — see Path Rules and Pseudonymization.


Policies

LogPrivacy ships four ready-made policies. All are fully composable — you can extend any of them with custom rules or field/path rules.

graph LR
    D["default\nemail · credential\ntoken · API key\nURL params · credit card"]
    S["strict\n+ IP address\n+ phone number"]
    W["web\nURL-focused variant\n(no credit cards)"]
    P["production\nstrict + raises\nLogBlockedError on\nhigh-risk categories"]

    D -- "extends" --> S
    D -- "variant" --> W
    S -- "adds enforcement" --> P
Loading
Policy Active rules Use when
CleanerPolicy.default() email, credentials, tokens, API keys, URLs, credit cards General-purpose — safe default for any project
CleanerPolicy.strict() everything above + IP addresses + phone numbers Healthcare, finance, high-sensitivity environments
CleanerPolicy.web() URLs, credentials, tokens, secrets HTTP access log processing
CleanerPolicy.production() strict + raises LogBlockedError on high-risk CI gates, production safety checks
from logprivacy import Cleaner, CleanerPolicy

# Strict mode: also catches IP addresses and phone numbers
cleaner = Cleaner(CleanerPolicy.strict())

# Production mode: raises instead of masking on critical categories
# Ideal for CI pipelines or zero-tolerance environments
cleaner = Cleaner(CleanerPolicy.production())

See docs/core/policies.md for full details on each policy.


Path Rules and Pseudonymization

PathRule matches fields by their full traversal path and takes precedence over FieldRule and sensitive_keys. Use glob patterns to match lists and nested structures:

from logprivacy import CleanerPolicy, PathRule

policy = CleanerPolicy.default().add_path_rules(
    PathRule.exact("user.email",            action="mask"),
    PathRule.glob("orders.*.card_number",   action="remove"),
    PathRule.exact("debug.raw_body",        action="truncate", max_chars=200),
    PathRule.exact("auth.token",            action="block"),   # raises LogBlockedError
)

# Pseudonymize specific fields with HMAC
from logprivacy import HMACMaskingStrategy

policy = policy.with_pseudonymizer(HMACMaskingStrategy(key=b"..."))
policy = policy.add_path_rules(
    PathRule.exact("user.id", action="pseudonymize"),
)

Declarative JSON policies

Policies can be serialized to and from JSON, enabling configuration-driven deployments without code changes:

# Serialize
json_str = policy.to_json()

# Deserialize (e.g., load from a config file or environment variable)
policy2 = CleanerPolicy.from_json(json_str)

# Or from a dict (useful with YAML/TOML loaders)
policy3 = CleanerPolicy.from_dict({
    "schema_version": 1,
    "base": "strict",
    "masking": "hash",
    "sensitive_keys": ["internal_id", "trace_token"],
    "field_rules": [
        {"match": "raw_body", "mode": "exact", "action": "truncate", "max_chars": 500}
    ],
})

See docs/data/structured-data.md for path-rule glob syntax, precedence rules, and the allow_paths allowlist.


Clean Log Files

Scan or sanitize existing log files on disk:

from logprivacy import scan_file, clean_file

# Inspect without modifying
report = scan_file("app.log")
print(report.describe())

# Clean atomically (writes to temp file, then replaces)
clean_file("app.log", output="app.clean.log")

# In-place cleaning
clean_file("app.log")

CLI

# Scan a log file — prints a summary of findings
python -m logprivacy scan app.log

# Clean a log file — outputs sanitized copy
python -m logprivacy clean app.log --output app.clean.log

# Clean a single string inline
python -m logprivacy text "email=john@example.com password=s3cr3t"
# email=[EMAIL] password=[SECRET]

What It Protects Against

Category Example input Output
Email addresses john@example.com [EMAIL]
Passwords and credentials password=123456 password=[SECRET]
API keys and access tokens api_key=sk_live_abc api_key=[SECRET]
Bearer tokens and JWTs Authorization: Bearer eyJ... [TOKEN]
Generic secrets secret=abc123456789 [SECRET]
Sensitive URL query params ?token=abc123 ?token=[SECRET]
Credit card numbers (Luhn) 4111111111111111 [CREDIT_CARD]
IP addresses (strict mode) 192.168.1.1 [IP_ADDRESS]
Phone numbers (strict mode) +1-800-555-0100 [PHONE]

Security Disclaimer

LogPrivacy reduces accidental sensitive-data exposure in logs. It is a safety net, not a DLP system.

  • Regex-based detection has false positives and false negatives. Novel secret formats, obfuscated values, or custom encodings may not be detected. Always review findings in your specific context.
  • Avoid logging sensitive data in the first place. LogPrivacy is the second control, not the first. Structure your code so secrets never reach log calls.
  • It does not replace secret management, encryption, access control, or legal privacy review. Compliance with GDPR, HIPAA, or PCI-DSS requires a legal assessment that goes beyond log redaction.
  • CleanerPolicy.production() turns silent leaks into loud failures — use it as a third control in CI and production to catch regressions early.

See docs/security/security-model.md for the full security model and threat boundaries.


Development

Setup

Linux / macOS:

python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install --upgrade pip
python3 -m pip install -e ".[dev]"

Windows (PowerShell):

python -m venv .venv
.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
python -m pip install -e ".[dev]"

Run all checks

./scripts/ci.sh

Individual commands

python -m ruff format .          # format code
python -m ruff check . --fix     # lint and auto-fix
python -m ruff format --check .  # verify formatting
python -m ruff check .           # lint only
python -m mypy src               # type check
python -m pytest -v              # run tests (742 tests)
python -m build                  # build distribution

CI matrix

OS Python versions
Linux (ubuntu-latest) 3.10, 3.11, 3.12, 3.13, 3.14
macOS (macos-latest) 3.10, 3.11, 3.12, 3.13, 3.14
Windows (windows-latest) 3.10, 3.11, 3.12, 3.13, 3.14

Design Goals

  1. Simple things should be simplepip install logprivacy + one import is enough to get started.
  2. Advanced usage should be composable — policies, rules, strategies, and path rules all layer cleanly.
  3. Logs should be safe by default — sensitive keys are redacted even without regex matches.
  4. Fail closed, not open — when in doubt (unsupported type, iteration error, depth limit), return a safe placeholder rather than the original value.
  5. Output should be predictable and explainable — every finding has a category, location, and reason.
  6. Runtime dependencies stay at zero — no third-party packages, ever.
  7. You should not need to replace your logging setup — the filter attaches to existing loggers.
  8. Security guidance should be honest — this library reduces risk, it does not replace a DLP system.

Status

Beta. The core API (clean, audit, assert_clean, safe_print, get_safe_logger, clean_url) is stable. Advanced features (path rules, JSONL, HMAC pseudonymization, declarative policies) are in active use.

See CHANGELOG.md for the full history.

About

Privacy-first log redaction for Python. Prevent secrets and PII from leaking into logs.

Topics

Resources

License

Stars

Watchers

Forks

Contributors