Skip to content

Replace data-lake TableManager/FileManager with native SessionContext#258

Open
robinskil wants to merge 2 commits into
mainfrom
claude/crazy-chebyshev-dee4fa
Open

Replace data-lake TableManager/FileManager with native SessionContext#258
robinskil wants to merge 2 commits into
mainfrom
claude/crazy-chebyshev-dee4fa

Conversation

@robinskil

Copy link
Copy Markdown
Collaborator

Context

beacon-data-lake wrapped DataFusion's SessionContext in two manager structs that had grown redundant with it:

  • TableManager was registered as the beacon.public schema provider but maintained its own HashMap table registry — duplicating exactly what DataFusion's native MemorySchemaProvider already does. Its only non-redundant jobs were persisting table definitions and rebuilding them at startup.
  • FileManager was mostly thin wrappers over constants and free functions, plus dataset discovery.
  • The DataLake struct that bundled them was built in Runtime::new, had its managers extracted, then dropped — its own SchemaProvider impl was never registered.

This collapses that indirection so table/file access goes through native DataFusion APIs.

Changes

  • TableManager → thin PersistentSchemaProvider — delegates the in-memory catalog (lookup/register/deregister/names) to a native MemorySchemaProvider, keeping only the one thing DataFusion doesn't do: persisting/removing tables://<name>/table.json on register/deregister.
  • Startup recovery → free init_tables(ctx, schema) reusing the existing loading/ordering modules. Deleted table_manager.rs and provider_factory.rs.
  • list_table_config reconstructs the TableDefinition from the live provider via the extracted definition_from_provider helper (no parallel definition registry).
  • FileManager → free functions (create_listing_url, create_temp_output_file, list_datasets, list_dataset_schema); the file_formats vec moves into Runtime. Deleted files/manager.rs.
  • DataLake struct → register_object_stores(ctx) free function.
  • Callers (runtime.rs, query compiler/from/output) drop the manager params and use session_ctx.table_provider(...), the catalog, and the free functions. The DDL handlers in actions.rs/materialized_view.rs are unchanged and still auto-persist through the schema provider.

Behavioral note

DataFusion's MemorySchemaProvider::register_table errors on an existing name, whereas the old TableManager silently overwrote. PersistentSchemaProvider::register_table now deregisters-then-registers to preserve the overwrite-on-existing semantics that the materialized-view refresh / Iceberg replace / alter paths rely on. (Caught by the materialized-view tests.)

Net ~390 fewer lines.

Testing

  • cargo build (full workspace) ✅
  • cargo test -p beacon-data-lake — 6 passed ✅
  • cargo test -p beacon-core — 18 unit tests + iceberg integration test (covers CREATE/INSERT/DROP/REFRESH/VIEW + JSON/output paths end-to-end) ✅
  • cargo test -p beacon-api — 4 passed ✅

The beacon-data-lake crate wrapped DataFusion's SessionContext in two manager
structs that had grown redundant with it. TableManager was registered as the
beacon.public schema provider but kept its own table registry duplicating
DataFusion's MemorySchemaProvider; FileManager was mostly thin wrappers over
constants and free functions; and the DataLake struct that bundled them was
built, drained, and dropped (its own SchemaProvider impl never used).

- TableManager -> thin PersistentSchemaProvider that delegates the in-memory
  catalog to a native MemorySchemaProvider and keeps only the side effect
  DataFusion lacks: persisting/removing tables://<name>/table.json on
  register/deregister. Startup recovery is now a free init_tables fn reusing the
  existing loading/ordering modules; list_table_config reconstructs the spec from
  the live provider via the extracted definition_from_provider helper.
- FileManager -> free functions (create_listing_url, create_temp_output_file,
  list_datasets, list_dataset_schema); the file-format vec moves into Runtime.
- DataLake struct -> register_object_stores free function.
- Callers go through session_ctx.table_provider, the catalog, and the free
  functions; the DDL handlers are unchanged and still auto-persist through the
  schema provider.

PersistentSchemaProvider::register_table deregisters-then-registers so it keeps
the old overwrite-on-existing semantics (MemorySchemaProvider errors instead),
which the materialized-view refresh/replace/alter paths rely on.

Net ~390 fewer lines. beacon-data-lake, beacon-core and beacon-api tests pass.
Copilot AI review requested due to automatic review settings June 15, 2026 10:32

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR simplifies beacon-data-lake by removing the custom TableManager/FileManager indirection and routing table/dataset access through DataFusion’s native SessionContext catalog APIs, while retaining persistence of table definitions via a thin schema-provider wrapper.

Changes:

  • Replaced TableManager with PersistentSchemaProvider (wrapping MemorySchemaProvider) plus init_tables(...) startup recovery.
  • Replaced FileManager with free functions in beacon-data-lake::files and moved file_formats ownership into beacon-core::Runtime.
  • Updated query compilation/output/from paths and runtime wiring to use SessionContext catalog/provider lookups and the new free functions.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
beacon-data-lake/src/table_runtime/table_manager.rs Deleted legacy manager-based schema provider and registry.
beacon-data-lake/src/table_runtime/schema_persistence.rs Extracted definition_from_provider(...) helper for persistence/config reconstruction.
beacon-data-lake/src/table_runtime/provider_factory.rs Deleted factory; provider construction now happens directly from TableDefinition.
beacon-data-lake/src/table_runtime/persistent_schema_provider.rs Added schema provider wrapper that persists/removes table definitions on (de)register.
beacon-data-lake/src/table_runtime/ordering.rs Added test coverage for dependency ordering via table-scan dependencies.
beacon-data-lake/src/table_runtime/mod.rs Added init_tables(...) startup rebuild using the live session + new schema provider.
beacon-data-lake/src/lib.rs Removed DataLake struct; added register_object_stores(...) and re-exports for new APIs.
beacon-data-lake/src/files/mod.rs Replaced manager module with free functions for listing URLs, temp outputs, dataset discovery/schema inference.
beacon-data-lake/src/files/manager.rs Deleted legacy FileManager.
beacon-core/src/runtime.rs Runtime now registers object stores directly, owns file_formats, registers the persistent schema provider, and uses free functions.
beacon-core/src/query/output.rs Output staging now uses beacon_data_lake::create_temp_output_file directly.
beacon-core/src/query/from.rs Table resolution now uses SessionContext catalog/provider APIs; listing URL resolution uses beacon_data_lake::create_listing_url.
beacon-core/src/query/compiler.rs JSON query compiler no longer depends on manager instances.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +46 to +49
Ok(provider) => {
let _ = schema.insert_loaded(table_name.clone(), provider);
tracing::info!("Registered table '{}'", table_name);
}
Comment on lines +74 to +77
// Keep current pagination semantics to avoid behavior regressions.
let start = offset.unwrap_or(0);
let end = limit.map(|l| start + l).unwrap_or(datasets.len());
let datasets = datasets.into_iter().skip(start).take(end - start).collect();
Comment on lines +15 to +18
use datafusion::{
catalog::TableProvider, datasource::listing::ListingTableUrl, error::DataFusionError,
prelude::SessionContext,
};
Pin the behaviors the data-lake refactor introduced:

- PersistentSchemaProvider unit tests (in-memory tables store): registration
  persists the definition and registers it in the catalog; re-registering an
  existing name overwrites instead of erroring (the behavior MemorySchemaProvider
  lacks, relied on by MV refresh / Iceberg replace+alter); deregister clears both
  the catalog and the persisted definition; ensure_default_table is idempotent.
- Runtime restart-recovery test: a base table (with data) and a dependent view
  created by one runtime are rebuilt by a fresh runtime via init_tables, proving
  the persist-on-register + dependency-ordered startup recovery round trip that
  replaces the old TableManager registry.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants