UC Data Quality Duplicates

A Databricks App that scans Unity Catalog metadata across all accessible catalogs to find duplicate datasets, recommends gold-standard tables, and surfaces group-level access permissions — helping data architects and engineers clean up data sprawl before it reaches analysts.

Features

Feature	Description
Multi-Catalog Scanner	Background scan of every accessible catalog with live progress polling — schemas, tables, columns, types, comments, timestamps, and permissions. Per-catalog breakdowns shown on the dashboard.
Result Caching	Scan results and duplicate groups are cached in `catalog_40_copper_uc_metadata.cache`. On startup the app loads from cache instantly; a fresh scan is triggered automatically if the cache is older than 7 days or the cache schema version has changed.
Permissions Viewer	Shows which groups and users have READ / WRITE access to each table via metadata snapshot tables — no `MANAGE` privilege needed. Permissions are merged across catalog, schema, and table grant levels.
Duplicate Detection	Clusters tables that represent the same entity using four similarity signals: column-name Jaccard, type compatibility, fuzzy table-name matching, and Unity Catalog table lineage (shared upstream sources / direct data flow). Uses a normalised-name grouping pre-filter for scalable comparison (~100K candidates from 143K tables). Groups are labelled by common entity name.
Lineage Integration	Loads the last 90 days of `system.access.table_lineage` (via snapshot) to identify tables that share upstream sources or directly feed each other. Enhances duplicate scoring, pipeline-stage tagging, and gold-standard ranking.
Gold Standard Scoring	Ranks each table in a duplicate group on column completeness, freshness, downstream consumer count, and upstream position to recommend the canonical dataset.
Table Comparison	Side-by-side column diff and permissions diff for any two tables.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                      Databricks App                          │
│                                                              │
│  ┌──────────────┐       ┌──────────────────────────────────┐ │
│  │   Frontend    │       │       FastAPI Backend             │ │
│  │  (Vanilla JS) │◄────►│                                  │ │
│  │              │       │  /api/catalog/*   (scan, cache)   │ │
│  │  Dashboard   │       │  /api/duplicates/* (groups)       │ │
│  │  Catalog     │       │  /api/compare/*   (diff, sample)  │ │
│  │  Duplicates  │       │                                  │ │
│  │  Compare     │       │  scanner.py   (SQL Statement API) │ │
│  └──────────────┘       │  cache.py     (UC cache layer)   │ │
│                         │  duplicates.py (detection engine) │ │
│                         │  comparator.py (table diff)      │ │
│                         └────────┬─────────────────────────┘ │
└──────────────────────────────────┼───────────────────────────┘
                                   │
                    ┌──────────────▼──────────────────┐
                    │       Unity Catalog               │
                    │                                   │
                    │  catalog_40_copper_uc_metadata     │
                    │   ├── metadata.*  (weekly CTAS     │
                    │   │   snapshots of                 │
                    │   │   system.information_schema)   │
                    │   └── cache.*    (scan results     │
                    │       + duplicate groups)          │
                    │                                   │
                    │  SQL Statement API                 │
                    │   (EXTERNAL_LINKS disposition)     │
                    └───────────────────────────────────┘

Project structure

uc-duplicate-data-detector/
├── databricks.yml              # DAB bundle config (targets, variables)
├── app.yaml                    # App runtime config (command, env vars)
├── app.py                      # FastAPI entrypoint
├── requirements.txt            # Python dependencies
├── server/
│   ├── config.py               # Dual-mode auth (local CLI / deployed App)
│   ├── scanner.py              # UC metadata scanner + permissions (via SQL)
│   ├── cache.py                # Cache manager (read/write scan results to UC)
│   ├── duplicates.py           # Duplicate detection + gold standard scoring
│   ├── comparator.py           # Table comparison + sample data
│   └── routes/
│       ├── catalog.py          # /api/catalog/*  (list, scan, cache, schemas, tables)
│       ├── duplicates.py       # /api/duplicates/*
│       └── compare.py          # /api/compare/*
├── frontend/
│   └── dist/                   # Static SPA (HTML/CSS/JS, no build step)
└── scripts/
    ├── deploy.sh                      # One-command deploy (bundle + app source)
    ├── create_governance_tables.sql   # CTAS snapshots of system.information_schema (run weekly)
    ├── ingest_group_membership        # Notebook: fetches SCIM group memberships into UC
    ├── generate_data.py               # Test data generator (Python + CLI)
    └── generate_data.sql              # Test data generator (pure SQL)

How it works

Metadata source

All metadata is read from snapshot tables in catalog_40_copper_uc_metadata.metadata — weekly CTAS copies of system.information_schema tables (catalogs, schemata, tables, columns, catalog_privileges, schema_privileges, table_privileges) plus system.access.table_lineage and system.access.column_lineage. These snapshots provide definer-rights access so the app's service principal can read metadata across all catalogs without MANAGE privileges.

Scan flow

POST /api/catalog/scan-all launches a background thread that queries each catalog sequentially (schemas, tables, columns, and three privilege levels).
The frontend polls GET /api/catalog/scan-status every 2 seconds with progress updates ("Scanning catalog_name (3/7)…").
Table lineage is loaded from the metadata.table_lineage snapshot (last 90 days): distinct source→target edges and per-table consumer counts.
Duplicate detection runs in the same background thread using a normalised-name pre-filter and four-component composite similarity scoring (columns, types, names, lineage).
Results are written to the UC cache (catalog_40_copper_uc_metadata.cache) so the next app restart loads instantly.
The SQL Statement API uses EXTERNAL_LINKS disposition to handle result sets exceeding the 25 MB inline limit.

Cache layer

Scan results and duplicate groups are persisted in two Delta tables with append-based versioning (scan_id):

Table	Contents
`cache.cache_metadata`	One row per scan: scan_id, cache version, timestamp, serialised scan-result JSON
`cache.duplicate_groups`	One row per group per scan: scan_id, label, tables, pairs, gold standard, scores, tags (complex fields as JSON)

Each new scan appends a new scan_id — previous scans are retained as auditable history. The app always loads from the latest scan_id.

Invalidation triggers:

Cache older than 7 days (matching the weekly metadata refresh cadence)
CACHE_VERSION constant in cache.py differs from the stored version (covers schema changes during development)

On startup the frontend calls GET /api/catalog/cache-status. If valid, POST /api/catalog/cache-load bulk-loads tables and schemas from the metadata snapshot (2 queries total, vs 6 × N catalogs for a full scan) and restores duplicate groups from cache. Table columns and permissions are loaded lazily on demand when viewing individual table details.

Duplicate group filtering

Groups are automatically tagged during detection:

Tag	Detection method	Default
`governance_view`	2-table group where a VIEW's columns are a subset of the paired TABLE	Hidden
`pipeline_stage`	Lineage-first: any table in the group directly feeds another member (confirmed via `table_lineage`). Fallback: all tables in medallion catalogs (gold/silver/bronze) spanning 2+ tiers	Hidden
`shared_source`	All tables in the group share at least one common upstream source table (requires lineage data for every member)	Visible (informational)

Frontend filters (Duplicates page): two checkboxes for tag-based hiding + a free-text catalog prefix filter. Pagination limits rendering to 50 groups at a time.

Prerequisites

Databricks CLI >= 0.230
Python 3.10+
A Databricks workspace with:
- Unity Catalog enabled
- A SQL warehouse (Serverless or Pro)
- The catalog_40_copper_uc_metadata catalog with a metadata schema (weekly CTAS snapshots)
- Permission to create Apps

Permissions

The app's service principal needs no MANAGE privilege. Permissions are read from weekly CTAS snapshot tables in catalog_40_copper_uc_metadata.metadata — mirrors of system.information_schema that provide definer-rights access.

Required grants for the service principal

Privilege	Scope	Purpose
`USE CATALOG`	`catalog_40_copper_uc_metadata`	Access the app catalog
`USE SCHEMA`	`catalog_40_copper_uc_metadata.metadata`	Read metadata snapshots
`SELECT`	`catalog_40_copper_uc_metadata.metadata`	Query schemas, tables, columns, privileges
`CREATE SCHEMA`	`catalog_40_copper_uc_metadata`	Create the `cache` schema on first scan
`CAN_USE`	SQL warehouse	Execute SQL queries

Quick setup

Step 1 — Create metadata snapshot tables (schedule weekly):

Open scripts/create_governance_tables.sql in a Databricks SQL editor and run all statements. This creates CTAS snapshot tables in catalog_40_copper_uc_metadata.metadata that mirror system.information_schema views. The executing user becomes the schema owner (definer) — their privileges determine what metadata is visible at refresh time.

Step 2 — Grant the SP access (replace <SP_ID>):

-- Metadata snapshot access
GRANT USE CATALOG ON CATALOG catalog_40_copper_uc_metadata TO `<SP_ID>`;
GRANT USE SCHEMA ON SCHEMA catalog_40_copper_uc_metadata.metadata TO `<SP_ID>`;
GRANT SELECT ON SCHEMA catalog_40_copper_uc_metadata.metadata TO `<SP_ID>`;

-- Cache schema (created automatically, but SP needs permission to create it)
GRANT CREATE SCHEMA ON CATALOG catalog_40_copper_uc_metadata TO `<SP_ID>`;

Getting started

1. Clone and authenticate

git clone https://github.com/richardli29/uc-data-quality-duplicates.git && cd uc-data-quality-duplicates

databricks auth login --host https://<WORKSPACE>.cloud.databricks.com

2. Find your warehouse ID

databricks warehouses list --output json | python3 -c "
import json, sys
for w in json.load(sys.stdin):
    print(f'{w[\"id\"]}  {w[\"name\"]}  ({w[\"state\"]})')
"

3. Configure

Edit two files with your workspace details:

app.yaml — set the SQL warehouse for runtime queries:

env:
  - name: WAREHOUSE_ID
    value: "abc123def456"

databricks.yml — set the workspace and warehouse for deployment:

targets:
  dev:
    workspace:
      host: https://<WORKSPACE>.cloud.databricks.com
    variables:
      warehouse_id: abc123def456

Note: CATALOG_NAME is not required. The app discovers and scans all accessible catalogs automatically.

4. (Optional) Generate test data

Creates 20 education-themed tables across 5 schemas with deliberate duplicates.

Option A — SQL notebook (paste into a Databricks SQL editor):

# Open scripts/generate_data.sql and set the widget to your catalog name

Option B — Python CLI:

python3 scripts/generate_data.py \
  --catalog my_catalog \
  --warehouse abc123def456

Schema	Tables	Purpose
`bronze`	`raw_students`, `raw_schools`, `raw_exam_results`, `raw_attendance`	Raw ingestion layer with original column names
`silver`	`students`, `schools`, `exam_results`, `attendance`	Cleaned and standardised
`gold`	`dim_students`, `dim_schools`, `fact_exam_results`, `fact_attendance_agg`	Curated with SCD tracking, aggregations, and documentation
`team_analytics`	`student_data`, `school_info`, `exam_scores`, `student_attendance`	Duplicate set with renamed columns
`team_reporting`	`pupils`, `school_directory`, `assessment_results`, `attendance_register`	Another duplicate set with different naming

5. Deploy

./scripts/deploy.sh

Or step by step:

databricks bundle validate
databricks bundle deploy
databricks apps start uc-data-duplicates
databricks apps deploy uc-data-duplicates \
  --source-code-path /Workspace/Users/<you>/.bundle/uc-data-duplicates/dev/files

6. Grant permissions to the app service principal

Find the SP ID:

databricks apps get uc-data-duplicates --output json | python3 -c "
import json, sys
d = json.load(sys.stdin)
print(d['service_principal_client_id'])
"

Then grant access (replace <SP_ID>):

-- Metadata snapshots
GRANT USE CATALOG ON CATALOG catalog_40_copper_uc_metadata TO `<SP_ID>`;
GRANT USE SCHEMA ON SCHEMA catalog_40_copper_uc_metadata.metadata TO `<SP_ID>`;
GRANT SELECT ON SCHEMA catalog_40_copper_uc_metadata.metadata TO `<SP_ID>`;

-- Cache schema creation
GRANT CREATE SCHEMA ON CATALOG catalog_40_copper_uc_metadata TO `<SP_ID>`;

And grant CAN_USE on the SQL warehouse:

databricks api patch /api/2.0/permissions/sql/warehouses/<WAREHOUSE_ID> \
  --json '{"access_control_list":[{"service_principal_name":"<SP_ID>","permission_level":"CAN_USE"}]}'

7. Open the app

databricks apps get uc-data-duplicates

Open the url from the output in your browser. You must be logged into the workspace first.

Using the app

Dashboard — on first load, the app checks for a valid cache. If found, results load instantly with a green "Loaded from cache" banner. Otherwise, click Scan All Catalogs to start a background scan with live progress. Stat cards show total schemas, tables, columns, duplicate groups, and per-catalog breakdowns.
Catalog Explorer — browse all catalogs in a tree (catalog > schema > table). Click a table to see columns, owner, comments, and merged permissions (loaded on demand).
Duplicates — view duplicate clusters labelled by entity name (e.g. "Students", "Exam Results") with similarity scores and gold-standard badges. Adjust the threshold slider and re-detect. Filter by catalog prefix, hide governance views and medallion pipeline stages. Cross-catalog duplicates are detected too.
Compare — pick any two tables (from any catalog) for a side-by-side column diff and permissions comparison.

API reference

Method	Path	Description
`GET`	`/api/catalog/list`	List all accessible catalogs
`POST`	`/api/catalog/scan-all`	Start a background scan (returns immediately)
`GET`	`/api/catalog/scan-status`	Poll current scan progress
`GET`	`/api/catalog/cache-status`	Check whether a valid cache exists
`POST`	`/api/catalog/cache-load`	Load scan results from cache + bulk metadata
`GET`	`/api/catalog/schemas?catalog=X`	List scanned schemas (optional catalog filter)
`GET`	`/api/catalog/tables?schema=gold&catalog=X`	List tables (lightweight summaries, no columns)
`GET`	`/api/catalog/table/{catalog}/{schema}/{table}`	Full metadata for one table (columns + permissions loaded on demand)
`POST`	`/api/duplicates/detect?threshold=0.5`	Start background re-detection with a custom threshold
`GET`	`/api/duplicates/detect-status`	Poll detection progress
`GET`	`/api/duplicates/groups`	Return pre-computed duplicate groups
`GET`	`/api/compare/{cat1}/{s1}/{t1}/{cat2}/{s2}/{t2}`	Column + permissions diff between two tables

Customisation

Similarity weights

Edit server/duplicates.py — the detect_duplicates function accepts:

col_weight (default 0.40) — Jaccard index on canonical column names
type_weight (default 0.25) — proportion of shared columns with compatible types
name_weight (default 0.15) — token-based Jaccard on table names
lin_weight (default 0.20) — lineage similarity (1.0 for direct A→B flow, otherwise Jaccard on upstream source tables)

Per-pair normalisation: If neither table in a pair has any lineage data, the lineage weight is redistributed proportionally to the other three components — so those pairs are scored identically to the pre-lineage algorithm. Only pairs where at least one table has lineage entries are scored with the 4-component weights.

Gold standard scoring

Each table in a duplicate group is scored on (max 40 pts):

Column completeness (10 pts) — tables with more columns score higher
Freshness (10 pts) — recently updated tables score higher
Consumer count (10 pts) — tables read by more downstream entities (notebooks, jobs, dashboards, queries) score higher. Based on distinct entity_id counts from table_lineage. This is the "social proof" signal — the table the organisation actually uses most.
Upstream position (10 pts) — if this table is an upstream source of other tables in the same group, it scores the full 10. Copies and derived tables score 0.

The consumer count and upstream position factors are only applied when lineage data is available. Without lineage, scoring reverts to the original 20-point scale (completeness + freshness).

Synonym mappings

Edit the _SYNONYMS dictionary in server/duplicates.py to add domain-specific column name mappings (e.g. student_id = learner_id = pupil_id). This improves duplicate detection when teams use different naming conventions.

Cache configuration

Edit server/cache.py to change:

CACHE_MAX_AGE_DAYS (default 7) — how long before the cache expires
CACHE_VERSION (default "3") — bump this when changing cache table schemas to force a rebuild

TODOs

Productionise group membership ingestion — The scripts/ingest_group_membership notebook fetches user and group membership data from the workspace SCIM API into catalog_40_copper_uc_metadata.metadata.group_membership. It currently runs under interactive user credentials requiring Workspace Admin privileges.

Target setup:

Create a dedicated service principal solely for this purpose (not the app's SP)
Grant it Workspace Admin role — this is the minimum role that can list all users and all group members via the SCIM API. There is no read-only SCIM permission available in Databricks; Workspace Admin is the floor.
Schedule as a weekly job alongside the metadata CTAS refresh
The app reads from the group_membership table at scan time — no admin privilege needed at app runtime

Permission model context (Databricks SCIM API):

Role	List users	List group members	Can modify
Account admin	Yes	Yes	Everything
Workspace admin	Yes	Yes	Add/remove users from workspace, manage workspace-local groups
Group manager (Preview)	No	Only managed groups	Membership of managed groups only
Regular user	No	No	No

Mitigations:

Dedicated SP limits blast radius (no other use)
Workspace admin cannot modify account-level groups, delete users from the account, access other workspaces, or change Unity Catalog permissions
Running as a scheduled job (not app runtime) means the admin credential is never exposed to the app or its users

Compare page lineage section — When viewing two tables side-by-side, add a "Lineage" panel showing:
- Whether one feeds the other (and via what entity type — notebook, job, pipeline)
- Their shared upstream sources
- Consumer count for each (who reads from them)
- Column-level mappings from column_lineage showing which source columns map to which target columns
Integrate column_lineage into duplicate detection — Currently only table_lineage is used. column_lineage (source→target column mappings) could:
- Confirm that two tables' matching columns actually originate from the same source column (stronger signal than name-matching alone)
- Improve the governance_view tagger by verifying the view's columns trace back to the paired table
- Provide column-level provenance on the Compare page (answering "are these the same data with renamed columns?")
Add shared_source filter checkbox — The shared_source tag is assigned but has no dedicated hide checkbox in the frontend filter controls. Consider adding one alongside the governance view and pipeline stage filters.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
frontend/dist		frontend/dist
scripts		scripts
server		server
.gitignore		.gitignore
README.md		README.md
app.py		app.py
app.yaml		app.yaml
databricks.yml		databricks.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

UC Data Quality Duplicates

Features

Architecture

Project structure

How it works

Metadata source

Scan flow

Cache layer

Duplicate group filtering

Prerequisites

Permissions

Required grants for the service principal

Quick setup

Getting started

1. Clone and authenticate

2. Find your warehouse ID

3. Configure

4. (Optional) Generate test data

5. Deploy

6. Grant permissions to the app service principal

7. Open the app

Using the app

API reference

Customisation

Similarity weights

Gold standard scoring

Synonym mappings

Cache configuration

TODOs

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages