Skip to content

nick-treece/uc-duplicate-data-detector

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

UC Data Quality Duplicates

A Databricks App that scans Unity Catalog metadata across all accessible catalogs to find duplicate datasets, recommends gold-standard tables, and surfaces group-level access permissions — helping data architects and engineers clean up data sprawl before it reaches analysts.

Features

Feature Description
Multi-Catalog Scanner Background scan of every accessible catalog with live progress polling — schemas, tables, columns, types, comments, timestamps, and permissions. Per-catalog breakdowns shown on the dashboard.
Result Caching Scan results and duplicate groups are cached in catalog_40_copper_uc_metadata.cache. On startup the app loads from cache instantly; a fresh scan is triggered automatically if the cache is older than 7 days or the cache schema version has changed.
Permissions Viewer Shows which groups and users have READ / WRITE access to each table via metadata snapshot tables — no MANAGE privilege needed. Permissions are merged across catalog, schema, and table grant levels.
Duplicate Detection Clusters tables that represent the same entity using four similarity signals: column-name Jaccard, type compatibility, fuzzy table-name matching, and Unity Catalog table lineage (shared upstream sources / direct data flow). Uses a normalised-name grouping pre-filter for scalable comparison (~100K candidates from 143K tables). Groups are labelled by common entity name.
Lineage Integration Loads the last 90 days of system.access.table_lineage (via snapshot) to identify tables that share upstream sources or directly feed each other. Enhances duplicate scoring, pipeline-stage tagging, and gold-standard ranking.
Gold Standard Scoring Ranks each table in a duplicate group on column completeness, freshness, downstream consumer count, and upstream position to recommend the canonical dataset.
Table Comparison Side-by-side column diff and permissions diff for any two tables.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                      Databricks App                          │
│                                                              │
│  ┌──────────────┐       ┌──────────────────────────────────┐ │
│  │   Frontend    │       │       FastAPI Backend             │ │
│  │  (Vanilla JS) │◄────►│                                  │ │
│  │              │       │  /api/catalog/*   (scan, cache)   │ │
│  │  Dashboard   │       │  /api/duplicates/* (groups)       │ │
│  │  Catalog     │       │  /api/compare/*   (diff, sample)  │ │
│  │  Duplicates  │       │                                  │ │
│  │  Compare     │       │  scanner.py   (SQL Statement API) │ │
│  └──────────────┘       │  cache.py     (UC cache layer)   │ │
│                         │  duplicates.py (detection engine) │ │
│                         │  comparator.py (table diff)      │ │
│                         └────────┬─────────────────────────┘ │
└──────────────────────────────────┼───────────────────────────┘
                                   │
                    ┌──────────────▼──────────────────┐
                    │       Unity Catalog               │
                    │                                   │
                    │  catalog_40_copper_uc_metadata     │
                    │   ├── metadata.*  (weekly CTAS     │
                    │   │   snapshots of                 │
                    │   │   system.information_schema)   │
                    │   └── cache.*    (scan results     │
                    │       + duplicate groups)          │
                    │                                   │
                    │  SQL Statement API                 │
                    │   (EXTERNAL_LINKS disposition)     │
                    └───────────────────────────────────┘

Project structure

uc-duplicate-data-detector/
├── databricks.yml              # DAB bundle config (targets, variables)
├── app.yaml                    # App runtime config (command, env vars)
├── app.py                      # FastAPI entrypoint
├── requirements.txt            # Python dependencies
├── server/
│   ├── config.py               # Dual-mode auth (local CLI / deployed App)
│   ├── scanner.py              # UC metadata scanner + permissions (via SQL)
│   ├── cache.py                # Cache manager (read/write scan results to UC)
│   ├── duplicates.py           # Duplicate detection + gold standard scoring
│   ├── comparator.py           # Table comparison + sample data
│   └── routes/
│       ├── catalog.py          # /api/catalog/*  (list, scan, cache, schemas, tables)
│       ├── duplicates.py       # /api/duplicates/*
│       └── compare.py          # /api/compare/*
├── frontend/
│   └── dist/                   # Static SPA (HTML/CSS/JS, no build step)
└── scripts/
    ├── deploy.sh                      # One-command deploy (bundle + app source)
    ├── create_governance_tables.sql   # CTAS snapshots of system.information_schema (run weekly)
    ├── ingest_group_membership        # Notebook: fetches SCIM group memberships into UC
    ├── generate_data.py               # Test data generator (Python + CLI)
    └── generate_data.sql              # Test data generator (pure SQL)

How it works

Metadata source

All metadata is read from snapshot tables in catalog_40_copper_uc_metadata.metadata — weekly CTAS copies of system.information_schema tables (catalogs, schemata, tables, columns, catalog_privileges, schema_privileges, table_privileges) plus system.access.table_lineage and system.access.column_lineage. These snapshots provide definer-rights access so the app's service principal can read metadata across all catalogs without MANAGE privileges.

Scan flow

  1. POST /api/catalog/scan-all launches a background thread that queries each catalog sequentially (schemas, tables, columns, and three privilege levels).
  2. The frontend polls GET /api/catalog/scan-status every 2 seconds with progress updates ("Scanning catalog_name (3/7)…").
  3. Table lineage is loaded from the metadata.table_lineage snapshot (last 90 days): distinct source→target edges and per-table consumer counts.
  4. Duplicate detection runs in the same background thread using a normalised-name pre-filter and four-component composite similarity scoring (columns, types, names, lineage).
  5. Results are written to the UC cache (catalog_40_copper_uc_metadata.cache) so the next app restart loads instantly.
  6. The SQL Statement API uses EXTERNAL_LINKS disposition to handle result sets exceeding the 25 MB inline limit.

Cache layer

Scan results and duplicate groups are persisted in two Delta tables with append-based versioning (scan_id):

Table Contents
cache.cache_metadata One row per scan: scan_id, cache version, timestamp, serialised scan-result JSON
cache.duplicate_groups One row per group per scan: scan_id, label, tables, pairs, gold standard, scores, tags (complex fields as JSON)

Each new scan appends a new scan_id — previous scans are retained as auditable history. The app always loads from the latest scan_id.

Invalidation triggers:

  • Cache older than 7 days (matching the weekly metadata refresh cadence)
  • CACHE_VERSION constant in cache.py differs from the stored version (covers schema changes during development)

On startup the frontend calls GET /api/catalog/cache-status. If valid, POST /api/catalog/cache-load bulk-loads tables and schemas from the metadata snapshot (2 queries total, vs 6 × N catalogs for a full scan) and restores duplicate groups from cache. Table columns and permissions are loaded lazily on demand when viewing individual table details.

Duplicate group filtering

Groups are automatically tagged during detection:

Tag Detection method Default
governance_view 2-table group where a VIEW's columns are a subset of the paired TABLE Hidden
pipeline_stage Lineage-first: any table in the group directly feeds another member (confirmed via table_lineage). Fallback: all tables in medallion catalogs (gold/silver/bronze) spanning 2+ tiers Hidden
shared_source All tables in the group share at least one common upstream source table (requires lineage data for every member) Visible (informational)

Frontend filters (Duplicates page): two checkboxes for tag-based hiding + a free-text catalog prefix filter. Pagination limits rendering to 50 groups at a time.

Prerequisites

  • Databricks CLI >= 0.230
  • Python 3.10+
  • A Databricks workspace with:
    • Unity Catalog enabled
    • A SQL warehouse (Serverless or Pro)
    • The catalog_40_copper_uc_metadata catalog with a metadata schema (weekly CTAS snapshots)
    • Permission to create Apps

Permissions

The app's service principal needs no MANAGE privilege. Permissions are read from weekly CTAS snapshot tables in catalog_40_copper_uc_metadata.metadata — mirrors of system.information_schema that provide definer-rights access.

Required grants for the service principal

Privilege Scope Purpose
USE CATALOG catalog_40_copper_uc_metadata Access the app catalog
USE SCHEMA catalog_40_copper_uc_metadata.metadata Read metadata snapshots
SELECT catalog_40_copper_uc_metadata.metadata Query schemas, tables, columns, privileges
CREATE SCHEMA catalog_40_copper_uc_metadata Create the cache schema on first scan
CAN_USE SQL warehouse Execute SQL queries

Quick setup

Step 1 — Create metadata snapshot tables (schedule weekly):

Open scripts/create_governance_tables.sql in a Databricks SQL editor and run all statements. This creates CTAS snapshot tables in catalog_40_copper_uc_metadata.metadata that mirror system.information_schema views. The executing user becomes the schema owner (definer) — their privileges determine what metadata is visible at refresh time.

Step 2 — Grant the SP access (replace <SP_ID>):

-- Metadata snapshot access
GRANT USE CATALOG ON CATALOG catalog_40_copper_uc_metadata TO `<SP_ID>`;
GRANT USE SCHEMA ON SCHEMA catalog_40_copper_uc_metadata.metadata TO `<SP_ID>`;
GRANT SELECT ON SCHEMA catalog_40_copper_uc_metadata.metadata TO `<SP_ID>`;

-- Cache schema (created automatically, but SP needs permission to create it)
GRANT CREATE SCHEMA ON CATALOG catalog_40_copper_uc_metadata TO `<SP_ID>`;

Getting started

1. Clone and authenticate

git clone https://github.com/richardli29/uc-data-quality-duplicates.git && cd uc-data-quality-duplicates

databricks auth login --host https://<WORKSPACE>.cloud.databricks.com

2. Find your warehouse ID

databricks warehouses list --output json | python3 -c "
import json, sys
for w in json.load(sys.stdin):
    print(f'{w[\"id\"]}  {w[\"name\"]}  ({w[\"state\"]})')
"

3. Configure

Edit two files with your workspace details:

app.yaml — set the SQL warehouse for runtime queries:

env:
  - name: WAREHOUSE_ID
    value: "abc123def456"

databricks.yml — set the workspace and warehouse for deployment:

targets:
  dev:
    workspace:
      host: https://<WORKSPACE>.cloud.databricks.com
    variables:
      warehouse_id: abc123def456

Note: CATALOG_NAME is not required. The app discovers and scans all accessible catalogs automatically.

4. (Optional) Generate test data

Creates 20 education-themed tables across 5 schemas with deliberate duplicates.

Option A — SQL notebook (paste into a Databricks SQL editor):

# Open scripts/generate_data.sql and set the widget to your catalog name

Option B — Python CLI:

python3 scripts/generate_data.py \
  --catalog my_catalog \
  --warehouse abc123def456
Schema Tables Purpose
bronze raw_students, raw_schools, raw_exam_results, raw_attendance Raw ingestion layer with original column names
silver students, schools, exam_results, attendance Cleaned and standardised
gold dim_students, dim_schools, fact_exam_results, fact_attendance_agg Curated with SCD tracking, aggregations, and documentation
team_analytics student_data, school_info, exam_scores, student_attendance Duplicate set with renamed columns
team_reporting pupils, school_directory, assessment_results, attendance_register Another duplicate set with different naming

5. Deploy

./scripts/deploy.sh

Or step by step:

databricks bundle validate
databricks bundle deploy
databricks apps start uc-data-duplicates
databricks apps deploy uc-data-duplicates \
  --source-code-path /Workspace/Users/<you>/.bundle/uc-data-duplicates/dev/files

6. Grant permissions to the app service principal

Find the SP ID:

databricks apps get uc-data-duplicates --output json | python3 -c "
import json, sys
d = json.load(sys.stdin)
print(d['service_principal_client_id'])
"

Then grant access (replace <SP_ID>):

-- Metadata snapshots
GRANT USE CATALOG ON CATALOG catalog_40_copper_uc_metadata TO `<SP_ID>`;
GRANT USE SCHEMA ON SCHEMA catalog_40_copper_uc_metadata.metadata TO `<SP_ID>`;
GRANT SELECT ON SCHEMA catalog_40_copper_uc_metadata.metadata TO `<SP_ID>`;

-- Cache schema creation
GRANT CREATE SCHEMA ON CATALOG catalog_40_copper_uc_metadata TO `<SP_ID>`;

And grant CAN_USE on the SQL warehouse:

databricks api patch /api/2.0/permissions/sql/warehouses/<WAREHOUSE_ID> \
  --json '{"access_control_list":[{"service_principal_name":"<SP_ID>","permission_level":"CAN_USE"}]}'

7. Open the app

databricks apps get uc-data-duplicates

Open the url from the output in your browser. You must be logged into the workspace first.

Using the app

  1. Dashboard — on first load, the app checks for a valid cache. If found, results load instantly with a green "Loaded from cache" banner. Otherwise, click Scan All Catalogs to start a background scan with live progress. Stat cards show total schemas, tables, columns, duplicate groups, and per-catalog breakdowns.
  2. Catalog Explorer — browse all catalogs in a tree (catalog > schema > table). Click a table to see columns, owner, comments, and merged permissions (loaded on demand).
  3. Duplicates — view duplicate clusters labelled by entity name (e.g. "Students", "Exam Results") with similarity scores and gold-standard badges. Adjust the threshold slider and re-detect. Filter by catalog prefix, hide governance views and medallion pipeline stages. Cross-catalog duplicates are detected too.
  4. Compare — pick any two tables (from any catalog) for a side-by-side column diff and permissions comparison.

API reference

Method Path Description
GET /api/catalog/list List all accessible catalogs
POST /api/catalog/scan-all Start a background scan (returns immediately)
GET /api/catalog/scan-status Poll current scan progress
GET /api/catalog/cache-status Check whether a valid cache exists
POST /api/catalog/cache-load Load scan results from cache + bulk metadata
GET /api/catalog/schemas?catalog=X List scanned schemas (optional catalog filter)
GET /api/catalog/tables?schema=gold&catalog=X List tables (lightweight summaries, no columns)
GET /api/catalog/table/{catalog}/{schema}/{table} Full metadata for one table (columns + permissions loaded on demand)
POST /api/duplicates/detect?threshold=0.5 Start background re-detection with a custom threshold
GET /api/duplicates/detect-status Poll detection progress
GET /api/duplicates/groups Return pre-computed duplicate groups
GET /api/compare/{cat1}/{s1}/{t1}/{cat2}/{s2}/{t2} Column + permissions diff between two tables

Customisation

Similarity weights

Edit server/duplicates.py — the detect_duplicates function accepts:

  • col_weight (default 0.40) — Jaccard index on canonical column names
  • type_weight (default 0.25) — proportion of shared columns with compatible types
  • name_weight (default 0.15) — token-based Jaccard on table names
  • lin_weight (default 0.20) — lineage similarity (1.0 for direct A→B flow, otherwise Jaccard on upstream source tables)

Per-pair normalisation: If neither table in a pair has any lineage data, the lineage weight is redistributed proportionally to the other three components — so those pairs are scored identically to the pre-lineage algorithm. Only pairs where at least one table has lineage entries are scored with the 4-component weights.

Gold standard scoring

Each table in a duplicate group is scored on (max 40 pts):

  • Column completeness (10 pts) — tables with more columns score higher
  • Freshness (10 pts) — recently updated tables score higher
  • Consumer count (10 pts) — tables read by more downstream entities (notebooks, jobs, dashboards, queries) score higher. Based on distinct entity_id counts from table_lineage. This is the "social proof" signal — the table the organisation actually uses most.
  • Upstream position (10 pts) — if this table is an upstream source of other tables in the same group, it scores the full 10. Copies and derived tables score 0.

The consumer count and upstream position factors are only applied when lineage data is available. Without lineage, scoring reverts to the original 20-point scale (completeness + freshness).

Synonym mappings

Edit the _SYNONYMS dictionary in server/duplicates.py to add domain-specific column name mappings (e.g. student_id = learner_id = pupil_id). This improves duplicate detection when teams use different naming conventions.

Cache configuration

Edit server/cache.py to change:

  • CACHE_MAX_AGE_DAYS (default 7) — how long before the cache expires
  • CACHE_VERSION (default "3") — bump this when changing cache table schemas to force a rebuild

TODOs

  • Productionise group membership ingestion — The scripts/ingest_group_membership notebook fetches user and group membership data from the workspace SCIM API into catalog_40_copper_uc_metadata.metadata.group_membership. It currently runs under interactive user credentials requiring Workspace Admin privileges.

    Target setup:

    1. Create a dedicated service principal solely for this purpose (not the app's SP)
    2. Grant it Workspace Admin role — this is the minimum role that can list all users and all group members via the SCIM API. There is no read-only SCIM permission available in Databricks; Workspace Admin is the floor.
    3. Schedule as a weekly job alongside the metadata CTAS refresh
    4. The app reads from the group_membership table at scan time — no admin privilege needed at app runtime

    Permission model context (Databricks SCIM API):

    Role List users List group members Can modify
    Account admin Yes Yes Everything
    Workspace admin Yes Yes Add/remove users from workspace, manage workspace-local groups
    Group manager (Preview) No Only managed groups Membership of managed groups only
    Regular user No No No

    Mitigations:

    • Dedicated SP limits blast radius (no other use)
    • Workspace admin cannot modify account-level groups, delete users from the account, access other workspaces, or change Unity Catalog permissions
    • Running as a scheduled job (not app runtime) means the admin credential is never exposed to the app or its users
  • Compare page lineage section — When viewing two tables side-by-side, add a "Lineage" panel showing:

    • Whether one feeds the other (and via what entity type — notebook, job, pipeline)
    • Their shared upstream sources
    • Consumer count for each (who reads from them)
    • Column-level mappings from column_lineage showing which source columns map to which target columns
  • Integrate column_lineage into duplicate detection — Currently only table_lineage is used. column_lineage (source→target column mappings) could:

    • Confirm that two tables' matching columns actually originate from the same source column (stronger signal than name-matching alone)
    • Improve the governance_view tagger by verifying the view's columns trace back to the paired table
    • Provide column-level provenance on the Compare page (answering "are these the same data with renamed columns?")
  • Add shared_source filter checkbox — The shared_source tag is assigned but has no dedicated hide checkbox in the frontend filter controls. Consider adding one alongside the governance view and pipeline stage filters.

License

MIT

About

A Databricks application designed to scan unity catalog metadata to assess the similarity of datasets to identify duplication of data and enable co-ordinated rationalisation of core datasets.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 93.2%
  • Jupyter Notebook 5.4%
  • Shell 1.4%