A Query.Farm VGI worker for DuckDB.
vgi-pii · a Query.Farm VGI worker · powered by Microsoft Presidio
A VGI worker that brings PII detection and redaction into DuckDB/SQL. It finds and removes personally-identifiable information — names, emails, phone numbers, credit cards, SSNs, locations, and more — from free text, as plain SQL functions, backed by Microsoft Presidio (analyzer + anonymizer) and a pinned spaCy model.
INSTALL vgi FROM community; LOAD vgi;
ATTACH 'pii' (TYPE vgi, LOCATION 'uv run pii_worker.py');
SELECT pii.has_pii('Call John Smith at john@example.com'); -- true
SELECT pii.redact('Call John Smith at john@example.com'); -- 'Call <PERSON> at <EMAIL_ADDRESS>'
SELECT pii.anonymize('Call John Smith at john@example.com'); -- 'Call **** at ****************'
SELECT pii.pii_types('Call John Smith at john@example.com'); -- ['EMAIL_ADDRESS', 'PERSON']
SELECT * FROM pii.detect_pii('Call John Smith at john@example.com');
SELECT * FROM pii.supported_entities() ORDER BY entity_type;The Presidio analyzer is wired to the en_core_web_sm spaCy pipeline
(~12 MB, MIT), pinned as a wheel dependency. Everything runs offline — no
network calls at query time — and the model loads once per worker process
and is amortised over every row of every query.
The split follows what the VGI SDK allows for each function shape:
-
Scalars take positional arguments only and resolve overloads by arity (DuckDB's
name := valuesyntax is a table-function/macro feature, not a scalar one). Every per-row answer is a scalar, so it works inline in any projection or predicate. The optionallanguageargument is an extra positional arity overload:SELECT has_pii(body) FROM messages; -- language defaults to 'en' SELECT has_pii(body, 'en') FROM messages; -- explicit language SELECT id, redact(body) FROM messages; SELECT body, anonymize(body) FROM messages; SELECT pii_types(body) FROM messages; -- VARCHAR[]
-
Table functions return many rows and do take
name := valueargs:detect_pii(text, language := 'en', score_threshold := 0.5)(one row per detected entity) andsupported_entities(language := 'en')(discovery).SELECT * FROM pii.detect_pii('Email john@example.com', score_threshold := 0.8); SELECT * FROM pii.supported_entities() ORDER BY entity_type;
NULL semantics. A NULL (or empty / whitespace-only) input yields NULL output
for every scalar, and no rows for detect_pii. Text with no detected PII is
returned unchanged by redact / anonymize. Nothing raises on odd input —
detection of a problematic row degrades to "no PII" rather than crashing the
worker.
| Function | Form | Signature | Returns |
|---|---|---|---|
has_pii |
scalar | (text[, language]) |
BOOLEAN (NULL if no text) |
redact |
scalar | (text[, language]) |
VARCHAR — entities → <TYPE> tags |
anonymize |
scalar | (text[, language]) |
VARCHAR — entities → * mask |
pii_types |
scalar | (text[, language]) |
VARCHAR[] — distinct entity types, sorted |
detect_pii |
table | (text, language := 'en', score_threshold := 0.5) |
(entity_type VARCHAR, text VARCHAR, start INT, end_pos INT, score DOUBLE) |
supported_entities |
table | (language := 'en') |
(entity_type VARCHAR) |
The language default is 'en'. detect_pii.score_threshold (and the
threshold used by all scalars) defaults to 0.5 — detections below that
confidence are dropped.
Presidio combines a spaCy NER model with a battery of pattern/context
recognizers (email, phone, credit card, IBAN, SSN, URL, IP, crypto, dates, …).
detect_pii returns one row per detected entity with its character offsets
(start inclusive, end_pos exclusive — end_pos is suffixed because end is
a SQL keyword) and confidence score. Use supported_entities() to discover
every type the analyzer can return.
redact and anonymize are two distinct anonymizer operators over the same
detections:
redactuses Presidio's default replace operator, swapping each entity for a<ENTITY_TYPE>tag (<PERSON>,<EMAIL_ADDRESS>, …) — readable, preserves what kind of value was there.anonymizeuses the mask operator, overwriting every character of each entity with*— preserves nothing about the value.
SELECT id,
redact(body) AS tagged,
anonymize(body) AS masked
FROM messages
WHERE has_pii(body);This worker is MIT. Native / model dependencies:
| Component | License | Notes |
|---|---|---|
vgi-pii (this worker) |
MIT | This repository's own code. |
presidio-analyzer |
MIT | PII detection. |
presidio-anonymizer |
MIT | Redaction / masking operators. |
spaCy |
MIT | NLP runtime. |
en_core_web_sm |
MIT | The pinned English spaCy pipeline (model weights). |
vgi-python |
Query Farm Source-Available | The VGI SDK. |
spaCy model licensing. The pinned
en_core_web_smmodel is MIT. Other spaCy models vary — many_sm/_lgEnglish models are MIT, while some non-English pipelines are CC-BY-SA (their training corpora carry that license). If you swap the pinned model (seeengine.SPACY_MODEL), check that model's license for your use.
Detection is only as complete as Presidio's recognizers + the spaCy model; consult the Presidio docs for the supported-entity matrix and accuracy caveats. PII detection is probabilistic — treat it as a strong filter, not a guarantee of complete removal.
uv sync --all-extras # .venv with vgi-python + presidio + spaCy + en_core_web_sm + dev tools
make test # pytest (unit + integration) + SQL end-to-end
make test-unit # pytest only
make test-sql # DuckDB sqllogictest files via haybarn-unittest
uv run ruff check . # lint
uv run mypy vgi_pii/tests/test_engine.py covers the pure Presidio logic (multiple PII types, none,
empty, NULL, unicode, very long text, threshold behaviour); tests/test_tables.py
drives detect_pii / supported_entities through the real bind→init→process
lifecycle in-process; tests/test_scalars.py spawns pii_worker.py over the VGI
client/RPC stack exactly as DuckDB would after ATTACH. The test/sql/*.test
files are DuckDB sqllogictest cases run by
haybarn-unittest
(uv tool install haybarn-unittest) against a real ATTACH + SELECT.
pii_worker.py entry point; assembles the `pii` catalog (inline uv script metadata); warms Presidio at startup
Makefile test / test-unit / test-sql / lint targets
vgi_pii/
engine.py pure Presidio lifecycle + logic (analyzer/anonymizer cached per-process; no Arrow/VGI)
scalars.py per-row scalars (arity overloads for the optional `language`)
tables.py table functions: detect_pii, supported_entities
schema_utils.py Arrow field/comment helper
tests/
harness.py in-process bind→init→process driver
test_engine.py pure-logic unit + edge tests
test_tables.py table-function integration tests
test_scalars.py per-row scalar overloads via vgi.client.Client
test/sql/
*.test DuckDB sqllogictest end-to-end cases (haybarn-unittest)
Written by Query.Farm.
Copyright 2026 Query Farm LLC - https://query.farm
