A Ruby gem with a C extension for high-performance regex-based redaction of sensitive data from strings.
DataRedactor scans text for sensitive data — API keys and cloud secrets, IBANs,
credit cards, national IDs, emails, phone numbers, IPs, and more — and replaces
each match with a placeholder. The scanning runs in a C extension backed by POSIX
regex.h, so the heavy lifting happens outside the Ruby VM and stays fast enough
to run inline on large payloads.
It ships 88 built-in patterns across 15+ countries, grouped into tags
(:credentials, :financial, :contact, ...) so you can redact only what you
care about. Beyond plain strings it can walk nested Hashes, Arrays, and JSON,
audit a payload without mutating it (scan), and plug into Logger, Rails, and
Rack. You can also register your own patterns at boot.
- Log scrubbing — drop the
Loggerformatter in so no secret or PII ever reaches disk or your log aggregator. - Rails parameter filtering — feed
filter_parametersa redactor-backed proc to keep request params out of logs and error reports. - HTTP request/response sanitising — Rack middleware scrubs response bodies and sensitive headers in flight.
- Sanitising LLM / API payloads — run
redact_deepover a params hash orredact_jsonover a JSON body before it leaves the process. - Compliance & auditing —
scanreports every match with byte offsets, tag, and pattern name without changing the text, for false-positive tuning. - Internal identifiers — register company-specific patterns (
add_pattern) or generate them from a person's name (name_pattern).
require "data_redactor"
text = "User CF is RSSMRA85M01H501Z and key is AKIAIOSFODNN7EXAMPLE"
DataRedactor.redact(text)
# => "User CF is [REDACTED] and key is [REDACTED]"only: and except: both accept a single value or an Array, mixing Symbols (tag names) and Strings (specific pattern names).
DataRedactor.tags
# => [:credentials, :financial, :tax_id, :national_id, :contact, :network, :travel, :other, :custom]
DataRedactor.pattern_names
# => ["aws_s3_presigned_url", "aws_access_key_id", "email", "phone_e164", "ipv4", ...]
# Tag-level filtering
DataRedactor.redact(text, only: [:credentials])
DataRedactor.redact(text, except: :contact)
# Single specific pattern
DataRedactor.redact(text, only: ["aws_access_key_id"])
# Mix — every credentials pattern PLUS aws_access_key_id (even if it lived in another tag)
DataRedactor.redact(text, only: [:credentials, "aws_access_key_id"])
# Combine — every contact pattern EXCEPT email
DataRedactor.redact(text, only: :contact, except: ["email"])Precedence: a pattern is redacted iff (only is nil OR matches only:) AND (does not match except:). except: always wins when the two overlap, so only: :contact, except: :contact produces a no-op (everything is excluded).
Errors: an unknown tag Symbol raises DataRedactor::UnknownTagError; an unknown pattern name String raises DataRedactor::UnknownPatternError.
By default every match is replaced with [REDACTED]. Use the placeholder: keyword to change this:
# Plain string — any replacement text
DataRedactor.redact(text, placeholder: "***")
DataRedactor.redact(text, placeholder: "")
# Tagged — embeds the pattern's tag name so you know what was redacted
DataRedactor.redact(text, placeholder: :tagged)
# "user@example.com" → "[REDACTED:CONTACT]"
# "AKIAIOSFODNN7EXAMPLE" → "[REDACTED:CREDENTIALS]"
# "DE89370400440532013000" → "[REDACTED:FINANCIAL]"
# Hash — deterministic 4-hex suffix of the matched value
# Same value always produces the same token — useful for correlating
# redactions across log lines without leaking the original.
DataRedactor.redact(text, placeholder: :hash)
# "user@example.com" → "[CONTACT_3d7a]"
# "user@example.com" → "[CONTACT_3d7a]" (same every time)
# "other@example.com" → "[CONTACT_91fc]" (different value, different hash)All three modes compose with only: and except::
DataRedactor.redact(text, only: :contact, placeholder: :tagged)DataRedactor.scan returns every match alongside the redacted string — useful for auditing, tuning false positives, and compliance pipelines:
result = DataRedactor.scan("User AKIAIOSFODNN7EXAMPLE logged in from 192.168.1.1")
# => {
# redacted: "User [REDACTED] logged in from [REDACTED]",
# matches: [
# { tag: :credentials, name: "aws_access_key_id", value: "AKIAIOSFODNN7EXAMPLE", start: 5, length: 20 },
# { tag: :network, name: "ipv4", value: "192.168.1.1", start: 35, length: 11 }
# ]
# }
# :start and :length are byte offsets into the original string
m = result[:matches].first
original_text.byteslice(m[:start], m[:length]) # => "AKIAIOSFODNN7EXAMPLE"
# Accepts the same filters as redact (tags + specific pattern names)
DataRedactor.scan(text, only: :credentials)
DataRedactor.scan(text, except: :network)
DataRedactor.scan(text, only: :contact, except: ["email"])Redact every string value inside a nested Hash or Array — useful for params hashes, Sidekiq job payloads, webhook bodies, and anything that isn't a flat string:
# Hash — returns a deep copy, never mutates the input
result = DataRedactor.redact_deep({
"user" => { "email" => "alice@example.com" },
"count" => 3,
"tags" => ["admin", "alice@example.com"]
})
# => { "user" => { "email" => "[REDACTED]" }, "count" => 3, "tags" => ["admin", "[REDACTED]"] }
# Hash keys are never touched — only values are redacted
# Non-string scalars (Integer, Float, nil, Boolean) pass through unchanged
# Accepts the same filters as redact
DataRedactor.redact_deep(params, only: :credentials)
DataRedactor.redact_deep(payload, except: :network, placeholder: :tagged)# JSON string — parse → redact_deep → re-serialise
safe_json = DataRedactor.redact_json('{"email":"alice@example.com","count":3}')
# => '{"email":"[REDACTED]","count":3}'
# Raises JSON::ParserError on invalid input
DataRedactor.redact_json("not json") # => JSON::ParserErrorTeams often have internal IDs that the gem can't ship. Register them at boot:
# String (POSIX ERE) or Regexp — both accepted
DataRedactor.add_pattern(name: "employee_id", regex: "EMP-[0-9]{6}")
DataRedactor.add_pattern(name: "ticket_ref", regex: /TICKET-[A-Z]{2}[0-9]{4}/, boundary: true)
# Custom patterns are tagged :custom by default; pass any built-in tag to group differently
DataRedactor.add_pattern(name: "internal_key", regex: "INT-[A-Z]{3}", tag: :credentials)
DataRedactor.redact(text) # runs all patterns including custom
DataRedactor.redact(text, only: [:custom]) # only user patterns
DataRedactor.redact(text, only: [:custom, :credentials]) # mix
DataRedactor.custom_patterns # => [{name:, source:, tag:, boundary:}, ...]
DataRedactor.remove_pattern("employee_id")
DataRedactor.clear_custom_patterns! # mostly for test suitesRegex rules — patterns must be POSIX ERE (the same engine used for built-ins). Not supported: \d, \s, \w, \b, lookahead/lookbehind, non-greedy quantifiers, named groups. Violations raise DataRedactor::InvalidPatternError at registration time, never at redaction time. Use [0-9] instead of \d, [[:space:]] instead of \s, etc.
boundary: true — wraps the pattern with (^|[^0-9A-Za-z])(PATTERN)([^0-9A-Za-z]|$) so it only fires when the token is not embedded in a longer alphanumeric string. Incompatible with patterns that contain capture groups.
Personal names can't ship as built-ins — every team has different ones — but the regex
boilerplate to match a name across its written variations is the same every time.
name_pattern generates that regex for you, ready to hand to add_pattern:
DataRedactor.add_pattern(
name: "person_mario_rossi",
regex: DataRedactor.name_pattern("Mario", "Rossi"),
tag: :contact
)
DataRedactor.redact("ticket from Mario Rossi about ...")
# => "ticket from [REDACTED] about ..."A single generated pattern matches all of these:
- Case —
Mario Rossi,mario rossi,MARIO ROSSI - Order —
Mario Rossi,Rossi Mario,Rossi, Mario,Rossi,Mario - Initials —
M. Rossi,M Rossi,Mario R.,M.R.,MR - Diacritics —
name_pattern("Jose", "Munoz")also matchesJosé Muñoz(and vice versa) - Separators — spaces and hyphens are interchangeable.
name_pattern("Anne-Marie", "Berg")matchesAnne-Marie Berg,Anne Marie Berg,AnneMarie Berg, and each half alone (Anne Berg,Marie Berg). Multi-word parts like"Van der Berg"tolerate any space/hyphen separator between words.
It does not match a name embedded in a longer word — Mario will not fire inside
Mariolino — because the generated pattern is boundary-wrapped. For that reason, register
it with the default boundary: false (the wrapper is already baked into the returned
string; boundary: true would double-wrap and reject its capture groups).
Pass middle: to also cover a middle name — both the no-middle and with-middle forms match:
DataRedactor.name_pattern("Mario", "Rossi", middle: "Luigi")
# matches "Mario Rossi" AND "Mario Luigi Rossi" AND "Rossi Mario Luigi"Optional adapters for Logger, Rails, and Rack. None are loaded automatically — require only what you use, and the gem adds zero runtime dependencies in the gemspec.
Drop-in Logger::Formatter replacement that scrubs every emitted line:
require "data_redactor/integrations/logger"
logger = Logger.new($stdout)
logger.formatter = DataRedactor::Integrations::Logger.new
logger.info("Auth failed for alice@example.com")
# => I, [...] -- : Auth failed for [REDACTED]Wraps an inner formatter (defaults to Logger::Formatter), so it composes with structured loggers. Forwards only:, except:, placeholder: to DataRedactor.redact. Exception messages and arbitrary objects are scrubbed too — the wrapped object is passed unchanged to the inner formatter so the exception cause chain is preserved; only the rendered string is redacted.
# config/initializers/filter_parameter_logging.rb
require "data_redactor/integrations/rails"
Rails.application.config.filter_parameters += [
DataRedactor::Integrations::Rails.filter
]Returns a (key, value) proc compatible with Rails' parameter filter. String values are mutated in place via String#replace so Rails sees the redacted value. Non-strings are left alone. Accepts the same only:/except:/placeholder: kwargs.
# config.ru
require "data_redactor/integrations/rack"
use DataRedactor::Integrations::Rack, scrub: [:body, :headers]
run MyAppscrub: selects which surfaces to redact (default [:body, :headers]):
:body— buffers the response body, runsDataRedactor.redactover it, returns it as a single chunk. Drops theContent-Lengthheader so the server recomputes (the redacted body may differ in byte length).:headers— scrubs sensitive response headers (Set-Cookie,Authorization,X-Api-Key,X-Auth-Token,X-Access-Token) in place, and sensitive request headers (HTTP_AUTHORIZATION,HTTP_PROXY_AUTHORIZATION,HTTP_COOKIE,HTTP_X_API_KEY,HTTP_X_AUTH_TOKEN,HTTP_X_ACCESS_TOKEN) in the env hash so any downstream middleware that logs them sees redacted values.
Pass an empty subset (e.g. scrub: [:headers]) to opt out of body wrapping. Forwards only:/except:/placeholder: to DataRedactor.redact. Unknown surfaces raise ArgumentError at boot.
Body wrapping is buffering. The middleware reads the entire response body into memory before scanning. For streaming endpoints (SSE, large file downloads, Rack::Hijack) use
scrub: [:headers]and rely on the Logger formatter for application logs instead.
The table below is a representative sample. Use DataRedactor.pattern_names for the canonical, machine-readable list — it stays in sync with the C extension automatically.
| # | Pattern | Example |
|---|---|---|
| — | AWS Access Key ID | AKIAIOSFODNN7EXAMPLE |
| — | AWS Secret Access Key | 40-character base64 string |
| — | Google API Key | AIzaSyXXXX... |
| — | GitHub Personal Access Token | github_pat_XXXX... |
| — | GitHub Classic PAT / OAuth | ghp_XXXX... / gho_XXXX... |
| — | Slack Webhook URL | https://hooks.slack.com/services/T.../B.../... |
| — | Stripe Secret Key | sk_live_XXXX... |
| — | Anthropic API Key | sk-ant-api03-XXXX... |
| — | OpenAI Project API Key | sk-proj-XXXX... |
| — | GitLab Personal Access Token | glpat-XXXX... |
| — | DigitalOcean PAT | dop_v1_XXXX... |
| — | Databricks API Token | dapiXXXX... |
| — | Sentry DSN | https://KEY@oNNN.ingest.sentry.io/PID |
| — | PEM Private Key header | -----BEGIN RSA PRIVATE KEY----- |
| — | Scaleway Access Key | SCW12345ABCDE6789FGHIJ |
| — | UUID v4 / Scaleway Secret Key | 550e8400-e29b-41d4-a716-446655440000 |
| # | Pattern | Example |
|---|---|---|
| 2 | Italian Codice Fiscale (basic) | RSSMRA85M01H501Z |
| 3 | Passport — letter prefix + digits | AB1234567 |
| 4 | Passport — 9 consecutive digits ¹ | 123456789 |
| 22 | Italian Codice Fiscale (omocodia) | RSSMRALPMNLH5LMZ |
| # | Pattern | Example |
|---|---|---|
| 11 | Credit card — Visa, Mastercard, Amex, Discover, JCB | 4111111111111111 |
| 12 | IPv4 address | 192.168.1.100 |
| # | Country | Example |
|---|---|---|
| 10 | Italy | IT60X0542811101000000123456 |
| 15 | France | FR7630006000011234567890189 |
| 16 | Germany | DE89370400440532013000 |
| 17 | Spain | ES9121000418450200051332 |
| 18 | Netherlands | NL91ABNA0417164300 |
| 19 | Belgium | BE68539007547034 |
| 20 | Portugal | PT50000201231234567890154 |
| 21 | Ireland | IE29AIBK93115212345678 |
| 28 | Sweden | SE4550000000058398257466 |
| 29 | Denmark | DK5000400440116243 |
| 30 | Norway | NO9386011117947 |
| 31 | Finland | FI2112345600000785 |
| 37 | Poland | PL61109010140000071219812874 |
| 38 | Austria | AT611904300234573201 |
| 39 | Switzerland | CH9300762011623852957 |
| 40 | Czechia | CZ6508000000192000145399 |
| 41 | Hungary | HU42117730161111101800000000 |
| 42 | Romania | RO49AAAA1B31007593840000 |
| # | Country | Type | Example |
|---|---|---|---|
| 23 | France | NIR / Social Security ¹ | 185126203450342 |
| 24 | Spain | DNI ¹ | 12345678Z |
| 25 | Spain | NIE | X1234567L |
| 26 | Netherlands | BSN ¹ | 123456789 |
| 27 | Poland | PESEL ¹ | 85121612345 |
| 32 | Belgium | National Number ¹ | 85121612345 |
| 33 | Sweden | Personnummer ¹ | 850101-1234 |
| 34 | Denmark | CPR Number ¹ | 010185-1234 |
| 35 | Norway | Fødselsnummer ¹ | 01018512345 |
| 36 | Finland | HETU ¹ | 010185-123A |
| 43 | Poland | PESEL (alt slot) ¹ | 90010112345 |
| 44 | Austria | Abgabenkontonummer ¹ | 123456789 |
| 45 | Switzerland | AHV Number ¹ | 756.1234.5678.90 |
| 46 | Czechia | Rodné číslo ¹ | 856121/1234 |
| 47 | Hungary | Tax ID ¹ | 8012345678 |
| 48 | Romania | CNP ¹ | 1850101123456 |
¹ Word-boundary protected — these patterns are wrapped with
(^|[^0-9A-Za-z])(PATTERN)([^0-9A-Za-z]|$)at compile time so they do not fire when the digit sequence appears inside a longer alphanumeric token.
redactor/
├── data_redactor.gemspec
├── Gemfile
├── Rakefile
├── lib/
│ ├── data_redactor.rb # Ruby entry point, loads the .so
│ └── data_redactor/
│ ├── version.rb
│ ├── name_pattern.rb # name_pattern helper — generates a name regex for add_pattern
│ └── integrations/ # soft-required Logger / Rails / Rack adapters
├── ext/
│ └── data_redactor/
│ ├── extconf.rb # Checks for C headers, generates Makefile (globs *.c)
│ ├── data_redactor.c # Entry point: Init_data_redactor only
│ ├── patterns.{c,h} # Built-in pattern table + compiled regex_t array
│ ├── placeholder.{c,h} # write_placeholder, djb2 hash, tag_name_for_bit
│ ├── redact.{c,h} # _redact + replace_all_matches + wrap_boundary
│ ├── scan.{c,h} # _scan + byte-offset replacement-log macros
│ ├── custom_patterns.{c,h} # Dynamic registry: add/remove/clear/list
│ └── tags.h # TAG_* bit constants
├── spec/
│ └── data_redactor_spec.rb # RSpec tests — at least one example per pattern, plus filter / placeholder / custom-pattern coverage
├── benchmark/ # Repo-only perf scripts (not packaged in the gem)
│ ├── README.md # How to run, what each script measures
│ ├── support/corpus.rb # Shared payload builders + pure-Ruby baseline redactor
│ ├── throughput.rb # MB/s on representative payloads
│ ├── vs_pure_ruby.rb # C extension vs pure-Ruby gsub (same 88 patterns)
│ ├── scaling.rb # Runtime vs input size 1KB → 50MB
│ └── per_pattern.rb # Per-pattern scan cost
└── docs/ # Design and execution docs for future work
├── standalone_matcher_design.md
└── combined_matcher_plan.md
- Ruby >= 2.7
- A C compiler (
gccorclang) — only required when installing the source gem - POSIX
regex.h— only required when installing the source gem (standard on Linux and macOS)
# Gemfile
gem "data_redactor"bundle installThat's it — there is nothing extra to configure for precompiled binaries. Bundler/RubyGems looks at your platform and Ruby version and picks the right gem automatically.
- On a supported platform (Linux glibc/musl, macOS Intel/ARM): bundler downloads a precompiled gem with the C extension already built. Install is near-instant — no compiler, no
make, noregex.hheaders needed. Especially valuable in slim Docker images (ruby:3.x-alpine,ruby:3.x-slim) that don't shipgcc. - On any other platform (FreeBSD, OpenBSD, etc.): bundler downloads the source gem and compiles the C extension on install — the same behavior as before 0.7.1. You'll need a C compiler and POSIX
regex.havailable.
Each precompiled gem ships compiled binaries for Ruby 3.1, 3.2, 3.3, and 3.4.
| Platform | Targets |
|---|---|
| Linux (glibc) | x86_64-linux, aarch64-linux |
| Linux (musl / Alpine) | x86_64-linux-musl, aarch64-linux-musl |
| macOS | x86_64-darwin (Intel), arm64-darwin (Apple Silicon) |
If your Gemfile.lock was generated on one platform but you deploy to another, run bundle lock --add-platform <target> so bundler resolves the right native gem at deploy time. Example for Alpine deploys built from a glibc dev box:
bundle lock --add-platform x86_64-linux-musl aarch64-linux-muslbundle exec rake compileThis runs extconf.rb via rake-compiler, which generates a Makefile and compiles data_redactor.c into a .so shared library placed under lib/data_redactor/.
Maintainers can rebuild the full set of native gems with one command (requires Docker):
bundle exec rake gem:allThis invokes rake-compiler-dock to cross-compile every supported (platform × Ruby ABI) combination. Output lands in pkg/.
bundle exec rake specOr compile and test in one step:
bundle exec rakeThe benchmark/ directory holds four scripts that measure the C engine under
different angles. They are not packaged with the gem.
bundle install # pulls benchmark-ips, benchmark-memory (dev deps)
bundle exec rake compile
bundle exec ruby benchmark/vs_pure_ruby.rb # head-to-head vs pure-Ruby gsub, same 88 patterns
bundle exec ruby benchmark/throughput.rb # MB/s on a log line, JSON, 1MB and 10MB log files
bundle exec ruby benchmark/scaling.rb # runtime vs input size (1KB → 50MB), confirms linear scaling
bundle exec ruby benchmark/per_pattern.rb # per-pattern scan cost over a 1MB payloadSee benchmark/README.md for what each script measures
and how the pure-Ruby baseline is kept honest (it reads the same patterns the
C engine uses, via DataRedactor::BUILTIN_PATTERN_SOURCES).
Recorded so we know where we started when the next round of perf work lands.
| Payload | C extension | Pure-Ruby gsub |
C vs Ruby |
|---|---|---|---|
| log line (168 B) | 0.30 ms / call | 0.07 ms / call | 3.4× slower |
| JSON blob (~580 B) | 0.92 ms / call | 0.18 ms / call | 5.0× slower |
| 100 log lines (~17 KB) | 26.5 ms / call | 6.1 ms / call | 4.4× slower |
| 1 MB log | 1.62 s / call | 0.38 s / call | 4.25× slower |
| 10 MB log | ~15 s | ~3.8 s | ~4× slower |
The C extension is currently 3-5× slower than pure-Ruby gsub at every
size measured. The cause is structural — glibc's POSIX regexec lacks
the Boyer-Moore literal pre-filter that Ruby's Onigmo engine has built in —
and is documented in detail under Known limitations.
Two perf fixes have already shipped (a strstr literal pre-filter and
chunked input above 64 KB), which got us 25-30% faster and restored linear
scaling, but the absolute gap remains.
The long-term plan is a combined multi-pattern matcher (design doc, execution plan) that compiles all 88 patterns into one automaton and walks the input once. That's expected to make the C extension genuinely the fastest option in Ruby; until it ships, use the gem on small payloads where absolute latency is acceptable (< 1 ms for typical log lines).
- At load time,
Init_data_redactorcompiles all 85 regex patterns once usingregcomp(POSIX ERE) and stores them as staticregex_tstructs. Patterns marked as boundary-wrapped are expanded withwrap_boundary()before compilation. DataRedactor.redact(text)receives a RubyString, converts it to a Cchar*viaStringValueCStr, and runs each compiled pattern in sequence on a working buffer.- For each pattern,
replace_all_matchesiterates usingregexec, copies non-matching segments to a fresh output buffer, and inserts[REDACTED]in place of each match. For boundary-wrapped patterns,regexecis called withnmatch=4and sub-match groups[1]/[3]identify the boundary characters so they are preserved verbatim. - The output buffer is grown with
reallocas needed. After all patterns are applied the result is returned as a RubyStringviarb_str_new_cstr. All intermediatemalloc/strdupallocations are explicitlyfreed.
All C-side buffers are heap-allocated with malloc/strdup and freed before the function returns. The only Ruby-managed allocation is the final return value from rb_str_new_cstr. No Ruby objects are created mid-processing, so GC cannot collect anything out from under the C code.
DataRedactor.redact and DataRedactor.scan are safe to call concurrently from multiple threads. Built-in patterns are compiled into a static regex_t array at load time and never mutated afterward, and each call allocates its own working buffers. POSIX regexec is documented as thread-safe.
DataRedactor.add_pattern, remove_pattern, and clear_custom_patterns! mutate a shared dynamic array and are not thread-safe. Register custom patterns once at boot — before spawning worker threads or forking — and they will be visible (read-only) to every subsequent redact/scan call.
This project follows Semantic Versioning 2.0.0. Until 1.0.0, minor versions may introduce breaking changes; from 1.0.0 onward, breaking changes will only land in major versions. See CHANGELOG.md for the release history.
Released under the MIT License.
- Pattern ordering matters — patterns run sequentially. An early broad pattern (e.g. the 9-digit passport) may consume digits that a later pattern (e.g. credit card) depends on. Boundary wrapping mitigates this for pure-digit patterns.
- AWS Secret Key (pattern 1) — 40 consecutive base64 characters is a broad match. It can produce false positives in base64-encoded content such as embedded images or binary blobs.
- Duplicate digit patterns — several national ID formats share the same digit-length (11 digits: PESEL, Norwegian Fødselsnummer, Belgian National Number). They are kept as separate slots for clarity but the practical effect is that any 11-digit boundary-delimited number will be redacted.
- Performance is currently slower than pure-Ruby
gsub. A May 2026 investigation found the C extension is 3–5× slower than a pure-Rubygsubloop running the same 88 patterns, across input sizes from 168 bytes to 1 MB. The root cause is glibc's POSIXregexec(): each call allocates an O(input-length) state buffer before any matching begins, and the gem calls it once per pattern in sequence. Ruby's Onigmo engine wins by using a built-in Boyer-Moore literal pre-filter that this gem can only approximate. Two perf fixes have shipped (buffer-sizing inreplace_all_matches, astrstrliteral pre-filter, and input chunking for large payloads), which gave ~25-30% improvement and made scaling linear, but the absolute gap remains. Use the gem on small payloads where the absolute latency is still acceptable (< 1 ms for typical log lines); for high-throughput pipelines, hold off until the next major release. Seedocs/standalone_matcher_design.mdfor the long-term plan.