Skip to content

Backlog: schema-aware incremental staging rebuilds #3

@anand-testcompare

Description

@anand-testcompare

Summary

The current incremental staging path is intentionally conservative:

  • it updates only tables touched by new raw Parquet batches
  • if the schema snapshot hash changes, it falls back to a full rebuild

This is the right tradeoff for the current scale, but longer term we should make schema changes trigger the smallest safe rebuild possible instead of a whole staging rebuild.

Goal

Make incremental staging schema-aware so we can rebuild only the affected tables when source schemas change.

Desired Behavior

  • no new raw files and no schema changes: do nothing
  • new raw files for a subset of tables: rebuild only those tables
  • schema change for one table: rebuild only that table
  • unknown or unsafe schema drift: fall back to full rebuild

Scope

Potential implementation:

  • persist prior schema snapshot (or table-level schema hashes) in staging state
  • diff previous vs current schema snapshot by table
  • union:
    • tables touched by new raw batches
    • tables whose schema changed
  • rematerialize only that set
  • keep full rebuild as the correctness fallback

Why Deferred

This is more complex than the current need justifies.

The edge cases and test matrix expand quickly:

  • additive columns
  • removed columns
  • renamed columns
  • int vs float widening
  • mixed historical/raw values vs new schema
  • nullability drift
  • nested/object fields staying JSON
  • partial state corruption and rebuild fallback
  • table deletion / disappearance

For now, the simpler behavior is acceptable:

  • incremental updates for new raw batches
  • full rebuild fallback on any schema snapshot change

Acceptance Criteria Later

  • schema diffing is table-aware
  • incremental mode rebuilds only affected tables when safe
  • unsafe schema changes fall back to full rebuild
  • tests cover representative schema drift scenarios

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions