A Databricks App that scans Unity Catalog metadata across all accessible catalogs to find duplicate datasets, recommends gold-standard tables, and surfaces group-level access permissions — helping data architects and engineers clean up data sprawl before it reaches analysts.
| Feature | Description |
|---|---|
| Multi-Catalog Scanner | Background scan of every accessible catalog with live progress polling — schemas, tables, columns, types, comments, timestamps, and permissions. Per-catalog breakdowns shown on the dashboard. |
| Result Caching | Scan results and duplicate groups are cached in catalog_40_copper_uc_metadata.cache. On startup the app loads from cache instantly; a fresh scan is triggered automatically if the cache is older than 7 days or the cache schema version has changed. |
| Permissions Viewer | Shows which groups and users have READ / WRITE access to each table via metadata snapshot tables — no MANAGE privilege needed. Permissions are merged across catalog, schema, and table grant levels. |
| Duplicate Detection | Clusters tables that represent the same entity using four similarity signals: column-name Jaccard, type compatibility, fuzzy table-name matching, and Unity Catalog table lineage (shared upstream sources / direct data flow). Uses a normalised-name grouping pre-filter for scalable comparison (~100K candidates from 143K tables). Groups are labelled by common entity name. |
| Lineage Integration | Loads the last 90 days of system.access.table_lineage (via snapshot) to identify tables that share upstream sources or directly feed each other. Enhances duplicate scoring, pipeline-stage tagging, and gold-standard ranking. |
| Gold Standard Scoring | Ranks each table in a duplicate group on column completeness, freshness, downstream consumer count, and upstream position to recommend the canonical dataset. |
| Table Comparison | Side-by-side column diff and permissions diff for any two tables. |
┌─────────────────────────────────────────────────────────────┐
│ Databricks App │
│ │
│ ┌──────────────┐ ┌──────────────────────────────────┐ │
│ │ Frontend │ │ FastAPI Backend │ │
│ │ (Vanilla JS) │◄────►│ │ │
│ │ │ │ /api/catalog/* (scan, cache) │ │
│ │ Dashboard │ │ /api/duplicates/* (groups) │ │
│ │ Catalog │ │ /api/compare/* (diff, sample) │ │
│ │ Duplicates │ │ │ │
│ │ Compare │ │ scanner.py (SQL Statement API) │ │
│ └──────────────┘ │ cache.py (UC cache layer) │ │
│ │ duplicates.py (detection engine) │ │
│ │ comparator.py (table diff) │ │
│ └────────┬─────────────────────────┘ │
└──────────────────────────────────┼───────────────────────────┘
│
┌──────────────▼──────────────────┐
│ Unity Catalog │
│ │
│ catalog_40_copper_uc_metadata │
│ ├── metadata.* (weekly CTAS │
│ │ snapshots of │
│ │ system.information_schema) │
│ └── cache.* (scan results │
│ + duplicate groups) │
│ │
│ SQL Statement API │
│ (EXTERNAL_LINKS disposition) │
└───────────────────────────────────┘
uc-duplicate-data-detector/
├── databricks.yml # DAB bundle config (targets, variables)
├── app.yaml # App runtime config (command, env vars)
├── app.py # FastAPI entrypoint
├── requirements.txt # Python dependencies
├── server/
│ ├── config.py # Dual-mode auth (local CLI / deployed App)
│ ├── scanner.py # UC metadata scanner + permissions (via SQL)
│ ├── cache.py # Cache manager (read/write scan results to UC)
│ ├── duplicates.py # Duplicate detection + gold standard scoring
│ ├── comparator.py # Table comparison + sample data
│ └── routes/
│ ├── catalog.py # /api/catalog/* (list, scan, cache, schemas, tables)
│ ├── duplicates.py # /api/duplicates/*
│ └── compare.py # /api/compare/*
├── frontend/
│ └── dist/ # Static SPA (HTML/CSS/JS, no build step)
└── scripts/
├── deploy.sh # One-command deploy (bundle + app source)
├── create_governance_tables.sql # CTAS snapshots of system.information_schema (run weekly)
├── ingest_group_membership # Notebook: fetches SCIM group memberships into UC
├── generate_data.py # Test data generator (Python + CLI)
└── generate_data.sql # Test data generator (pure SQL)
All metadata is read from snapshot tables in catalog_40_copper_uc_metadata.metadata — weekly CTAS copies of system.information_schema tables (catalogs, schemata, tables, columns, catalog_privileges, schema_privileges, table_privileges) plus system.access.table_lineage and system.access.column_lineage. These snapshots provide definer-rights access so the app's service principal can read metadata across all catalogs without MANAGE privileges.
- POST /api/catalog/scan-all launches a background thread that queries each catalog sequentially (schemas, tables, columns, and three privilege levels).
- The frontend polls GET /api/catalog/scan-status every 2 seconds with progress updates ("Scanning catalog_name (3/7)…").
- Table lineage is loaded from the
metadata.table_lineagesnapshot (last 90 days): distinct source→target edges and per-table consumer counts. - Duplicate detection runs in the same background thread using a normalised-name pre-filter and four-component composite similarity scoring (columns, types, names, lineage).
- Results are written to the UC cache (
catalog_40_copper_uc_metadata.cache) so the next app restart loads instantly. - The SQL Statement API uses
EXTERNAL_LINKSdisposition to handle result sets exceeding the 25 MB inline limit.
Scan results and duplicate groups are persisted in two Delta tables with append-based versioning (scan_id):
| Table | Contents |
|---|---|
cache.cache_metadata |
One row per scan: scan_id, cache version, timestamp, serialised scan-result JSON |
cache.duplicate_groups |
One row per group per scan: scan_id, label, tables, pairs, gold standard, scores, tags (complex fields as JSON) |
Each new scan appends a new scan_id — previous scans are retained as auditable history. The app always loads from the latest scan_id.
Invalidation triggers:
- Cache older than 7 days (matching the weekly metadata refresh cadence)
CACHE_VERSIONconstant incache.pydiffers from the stored version (covers schema changes during development)
On startup the frontend calls GET /api/catalog/cache-status. If valid, POST /api/catalog/cache-load bulk-loads tables and schemas from the metadata snapshot (2 queries total, vs 6 × N catalogs for a full scan) and restores duplicate groups from cache. Table columns and permissions are loaded lazily on demand when viewing individual table details.
Groups are automatically tagged during detection:
| Tag | Detection method | Default |
|---|---|---|
governance_view |
2-table group where a VIEW's columns are a subset of the paired TABLE | Hidden |
pipeline_stage |
Lineage-first: any table in the group directly feeds another member (confirmed via table_lineage). Fallback: all tables in medallion catalogs (gold/silver/bronze) spanning 2+ tiers |
Hidden |
shared_source |
All tables in the group share at least one common upstream source table (requires lineage data for every member) | Visible (informational) |
Frontend filters (Duplicates page): two checkboxes for tag-based hiding + a free-text catalog prefix filter. Pagination limits rendering to 50 groups at a time.
- Databricks CLI >= 0.230
- Python 3.10+
- A Databricks workspace with:
- Unity Catalog enabled
- A SQL warehouse (Serverless or Pro)
- The
catalog_40_copper_uc_metadatacatalog with ametadataschema (weekly CTAS snapshots) - Permission to create Apps
The app's service principal needs no MANAGE privilege. Permissions are read from weekly CTAS snapshot tables in catalog_40_copper_uc_metadata.metadata — mirrors of system.information_schema that provide definer-rights access.
| Privilege | Scope | Purpose |
|---|---|---|
USE CATALOG |
catalog_40_copper_uc_metadata |
Access the app catalog |
USE SCHEMA |
catalog_40_copper_uc_metadata.metadata |
Read metadata snapshots |
SELECT |
catalog_40_copper_uc_metadata.metadata |
Query schemas, tables, columns, privileges |
CREATE SCHEMA |
catalog_40_copper_uc_metadata |
Create the cache schema on first scan |
CAN_USE |
SQL warehouse | Execute SQL queries |
Step 1 — Create metadata snapshot tables (schedule weekly):
Open scripts/create_governance_tables.sql in a Databricks SQL editor and run all statements. This creates CTAS snapshot tables in catalog_40_copper_uc_metadata.metadata that mirror system.information_schema views. The executing user becomes the schema owner (definer) — their privileges determine what metadata is visible at refresh time.
Step 2 — Grant the SP access (replace <SP_ID>):
-- Metadata snapshot access
GRANT USE CATALOG ON CATALOG catalog_40_copper_uc_metadata TO `<SP_ID>`;
GRANT USE SCHEMA ON SCHEMA catalog_40_copper_uc_metadata.metadata TO `<SP_ID>`;
GRANT SELECT ON SCHEMA catalog_40_copper_uc_metadata.metadata TO `<SP_ID>`;
-- Cache schema (created automatically, but SP needs permission to create it)
GRANT CREATE SCHEMA ON CATALOG catalog_40_copper_uc_metadata TO `<SP_ID>`;git clone https://github.com/richardli29/uc-data-quality-duplicates.git && cd uc-data-quality-duplicates
databricks auth login --host https://<WORKSPACE>.cloud.databricks.comdatabricks warehouses list --output json | python3 -c "
import json, sys
for w in json.load(sys.stdin):
print(f'{w[\"id\"]} {w[\"name\"]} ({w[\"state\"]})')
"Edit two files with your workspace details:
app.yaml — set the SQL warehouse for runtime queries:
env:
- name: WAREHOUSE_ID
value: "abc123def456"databricks.yml — set the workspace and warehouse for deployment:
targets:
dev:
workspace:
host: https://<WORKSPACE>.cloud.databricks.com
variables:
warehouse_id: abc123def456Note:
CATALOG_NAMEis not required. The app discovers and scans all accessible catalogs automatically.
Creates 20 education-themed tables across 5 schemas with deliberate duplicates.
Option A — SQL notebook (paste into a Databricks SQL editor):
# Open scripts/generate_data.sql and set the widget to your catalog nameOption B — Python CLI:
python3 scripts/generate_data.py \
--catalog my_catalog \
--warehouse abc123def456| Schema | Tables | Purpose |
|---|---|---|
bronze |
raw_students, raw_schools, raw_exam_results, raw_attendance |
Raw ingestion layer with original column names |
silver |
students, schools, exam_results, attendance |
Cleaned and standardised |
gold |
dim_students, dim_schools, fact_exam_results, fact_attendance_agg |
Curated with SCD tracking, aggregations, and documentation |
team_analytics |
student_data, school_info, exam_scores, student_attendance |
Duplicate set with renamed columns |
team_reporting |
pupils, school_directory, assessment_results, attendance_register |
Another duplicate set with different naming |
./scripts/deploy.shOr step by step:
databricks bundle validate
databricks bundle deploy
databricks apps start uc-data-duplicates
databricks apps deploy uc-data-duplicates \
--source-code-path /Workspace/Users/<you>/.bundle/uc-data-duplicates/dev/filesFind the SP ID:
databricks apps get uc-data-duplicates --output json | python3 -c "
import json, sys
d = json.load(sys.stdin)
print(d['service_principal_client_id'])
"Then grant access (replace <SP_ID>):
-- Metadata snapshots
GRANT USE CATALOG ON CATALOG catalog_40_copper_uc_metadata TO `<SP_ID>`;
GRANT USE SCHEMA ON SCHEMA catalog_40_copper_uc_metadata.metadata TO `<SP_ID>`;
GRANT SELECT ON SCHEMA catalog_40_copper_uc_metadata.metadata TO `<SP_ID>`;
-- Cache schema creation
GRANT CREATE SCHEMA ON CATALOG catalog_40_copper_uc_metadata TO `<SP_ID>`;And grant CAN_USE on the SQL warehouse:
databricks api patch /api/2.0/permissions/sql/warehouses/<WAREHOUSE_ID> \
--json '{"access_control_list":[{"service_principal_name":"<SP_ID>","permission_level":"CAN_USE"}]}'databricks apps get uc-data-duplicatesOpen the url from the output in your browser. You must be logged into the workspace first.
- Dashboard — on first load, the app checks for a valid cache. If found, results load instantly with a green "Loaded from cache" banner. Otherwise, click Scan All Catalogs to start a background scan with live progress. Stat cards show total schemas, tables, columns, duplicate groups, and per-catalog breakdowns.
- Catalog Explorer — browse all catalogs in a tree (catalog > schema > table). Click a table to see columns, owner, comments, and merged permissions (loaded on demand).
- Duplicates — view duplicate clusters labelled by entity name (e.g. "Students", "Exam Results") with similarity scores and gold-standard badges. Adjust the threshold slider and re-detect. Filter by catalog prefix, hide governance views and medallion pipeline stages. Cross-catalog duplicates are detected too.
- Compare — pick any two tables (from any catalog) for a side-by-side column diff and permissions comparison.
| Method | Path | Description |
|---|---|---|
GET |
/api/catalog/list |
List all accessible catalogs |
POST |
/api/catalog/scan-all |
Start a background scan (returns immediately) |
GET |
/api/catalog/scan-status |
Poll current scan progress |
GET |
/api/catalog/cache-status |
Check whether a valid cache exists |
POST |
/api/catalog/cache-load |
Load scan results from cache + bulk metadata |
GET |
/api/catalog/schemas?catalog=X |
List scanned schemas (optional catalog filter) |
GET |
/api/catalog/tables?schema=gold&catalog=X |
List tables (lightweight summaries, no columns) |
GET |
/api/catalog/table/{catalog}/{schema}/{table} |
Full metadata for one table (columns + permissions loaded on demand) |
POST |
/api/duplicates/detect?threshold=0.5 |
Start background re-detection with a custom threshold |
GET |
/api/duplicates/detect-status |
Poll detection progress |
GET |
/api/duplicates/groups |
Return pre-computed duplicate groups |
GET |
/api/compare/{cat1}/{s1}/{t1}/{cat2}/{s2}/{t2} |
Column + permissions diff between two tables |
Edit server/duplicates.py — the detect_duplicates function accepts:
col_weight(default 0.40) — Jaccard index on canonical column namestype_weight(default 0.25) — proportion of shared columns with compatible typesname_weight(default 0.15) — token-based Jaccard on table nameslin_weight(default 0.20) — lineage similarity (1.0 for direct A→B flow, otherwise Jaccard on upstream source tables)
Per-pair normalisation: If neither table in a pair has any lineage data, the lineage weight is redistributed proportionally to the other three components — so those pairs are scored identically to the pre-lineage algorithm. Only pairs where at least one table has lineage entries are scored with the 4-component weights.
Each table in a duplicate group is scored on (max 40 pts):
- Column completeness (10 pts) — tables with more columns score higher
- Freshness (10 pts) — recently updated tables score higher
- Consumer count (10 pts) — tables read by more downstream entities (notebooks, jobs, dashboards, queries) score higher. Based on distinct
entity_idcounts fromtable_lineage. This is the "social proof" signal — the table the organisation actually uses most. - Upstream position (10 pts) — if this table is an upstream source of other tables in the same group, it scores the full 10. Copies and derived tables score 0.
The consumer count and upstream position factors are only applied when lineage data is available. Without lineage, scoring reverts to the original 20-point scale (completeness + freshness).
Edit the _SYNONYMS dictionary in server/duplicates.py to add domain-specific column name mappings (e.g. student_id = learner_id = pupil_id). This improves duplicate detection when teams use different naming conventions.
Edit server/cache.py to change:
CACHE_MAX_AGE_DAYS(default 7) — how long before the cache expiresCACHE_VERSION(default "3") — bump this when changing cache table schemas to force a rebuild
-
Productionise group membership ingestion — The
scripts/ingest_group_membershipnotebook fetches user and group membership data from the workspace SCIM API intocatalog_40_copper_uc_metadata.metadata.group_membership. It currently runs under interactive user credentials requiring Workspace Admin privileges.Target setup:
- Create a dedicated service principal solely for this purpose (not the app's SP)
- Grant it Workspace Admin role — this is the minimum role that can list all users and all group members via the SCIM API. There is no read-only SCIM permission available in Databricks; Workspace Admin is the floor.
- Schedule as a weekly job alongside the metadata CTAS refresh
- The app reads from the
group_membershiptable at scan time — no admin privilege needed at app runtime
Permission model context (Databricks SCIM API):
Role List users List group members Can modify Account admin Yes Yes Everything Workspace admin Yes Yes Add/remove users from workspace, manage workspace-local groups Group manager (Preview) No Only managed groups Membership of managed groups only Regular user No No No Mitigations:
- Dedicated SP limits blast radius (no other use)
- Workspace admin cannot modify account-level groups, delete users from the account, access other workspaces, or change Unity Catalog permissions
- Running as a scheduled job (not app runtime) means the admin credential is never exposed to the app or its users
-
Compare page lineage section — When viewing two tables side-by-side, add a "Lineage" panel showing:
- Whether one feeds the other (and via what entity type — notebook, job, pipeline)
- Their shared upstream sources
- Consumer count for each (who reads from them)
- Column-level mappings from
column_lineageshowing which source columns map to which target columns
-
Integrate
column_lineageinto duplicate detection — Currently onlytable_lineageis used.column_lineage(source→target column mappings) could:- Confirm that two tables' matching columns actually originate from the same source column (stronger signal than name-matching alone)
- Improve the
governance_viewtagger by verifying the view's columns trace back to the paired table - Provide column-level provenance on the Compare page (answering "are these the same data with renamed columns?")
-
Add
shared_sourcefilter checkbox — Theshared_sourcetag is assigned but has no dedicated hide checkbox in the frontend filter controls. Consider adding one alongside the governance view and pipeline stage filters.
MIT