Replace data-lake TableManager/FileManager with native SessionContext#258
Open
robinskil wants to merge 2 commits into
Open
Replace data-lake TableManager/FileManager with native SessionContext#258robinskil wants to merge 2 commits into
robinskil wants to merge 2 commits into
Conversation
The beacon-data-lake crate wrapped DataFusion's SessionContext in two manager structs that had grown redundant with it. TableManager was registered as the beacon.public schema provider but kept its own table registry duplicating DataFusion's MemorySchemaProvider; FileManager was mostly thin wrappers over constants and free functions; and the DataLake struct that bundled them was built, drained, and dropped (its own SchemaProvider impl never used). - TableManager -> thin PersistentSchemaProvider that delegates the in-memory catalog to a native MemorySchemaProvider and keeps only the side effect DataFusion lacks: persisting/removing tables://<name>/table.json on register/deregister. Startup recovery is now a free init_tables fn reusing the existing loading/ordering modules; list_table_config reconstructs the spec from the live provider via the extracted definition_from_provider helper. - FileManager -> free functions (create_listing_url, create_temp_output_file, list_datasets, list_dataset_schema); the file-format vec moves into Runtime. - DataLake struct -> register_object_stores free function. - Callers go through session_ctx.table_provider, the catalog, and the free functions; the DDL handlers are unchanged and still auto-persist through the schema provider. PersistentSchemaProvider::register_table deregisters-then-registers so it keeps the old overwrite-on-existing semantics (MemorySchemaProvider errors instead), which the materialized-view refresh/replace/alter paths rely on. Net ~390 fewer lines. beacon-data-lake, beacon-core and beacon-api tests pass.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR simplifies beacon-data-lake by removing the custom TableManager/FileManager indirection and routing table/dataset access through DataFusion’s native SessionContext catalog APIs, while retaining persistence of table definitions via a thin schema-provider wrapper.
Changes:
- Replaced
TableManagerwithPersistentSchemaProvider(wrappingMemorySchemaProvider) plusinit_tables(...)startup recovery. - Replaced
FileManagerwith free functions inbeacon-data-lake::filesand movedfile_formatsownership intobeacon-core::Runtime. - Updated query compilation/output/from paths and runtime wiring to use
SessionContextcatalog/provider lookups and the new free functions.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| beacon-data-lake/src/table_runtime/table_manager.rs | Deleted legacy manager-based schema provider and registry. |
| beacon-data-lake/src/table_runtime/schema_persistence.rs | Extracted definition_from_provider(...) helper for persistence/config reconstruction. |
| beacon-data-lake/src/table_runtime/provider_factory.rs | Deleted factory; provider construction now happens directly from TableDefinition. |
| beacon-data-lake/src/table_runtime/persistent_schema_provider.rs | Added schema provider wrapper that persists/removes table definitions on (de)register. |
| beacon-data-lake/src/table_runtime/ordering.rs | Added test coverage for dependency ordering via table-scan dependencies. |
| beacon-data-lake/src/table_runtime/mod.rs | Added init_tables(...) startup rebuild using the live session + new schema provider. |
| beacon-data-lake/src/lib.rs | Removed DataLake struct; added register_object_stores(...) and re-exports for new APIs. |
| beacon-data-lake/src/files/mod.rs | Replaced manager module with free functions for listing URLs, temp outputs, dataset discovery/schema inference. |
| beacon-data-lake/src/files/manager.rs | Deleted legacy FileManager. |
| beacon-core/src/runtime.rs | Runtime now registers object stores directly, owns file_formats, registers the persistent schema provider, and uses free functions. |
| beacon-core/src/query/output.rs | Output staging now uses beacon_data_lake::create_temp_output_file directly. |
| beacon-core/src/query/from.rs | Table resolution now uses SessionContext catalog/provider APIs; listing URL resolution uses beacon_data_lake::create_listing_url. |
| beacon-core/src/query/compiler.rs | JSON query compiler no longer depends on manager instances. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+46
to
+49
| Ok(provider) => { | ||
| let _ = schema.insert_loaded(table_name.clone(), provider); | ||
| tracing::info!("Registered table '{}'", table_name); | ||
| } |
Comment on lines
+74
to
+77
| // Keep current pagination semantics to avoid behavior regressions. | ||
| let start = offset.unwrap_or(0); | ||
| let end = limit.map(|l| start + l).unwrap_or(datasets.len()); | ||
| let datasets = datasets.into_iter().skip(start).take(end - start).collect(); |
Comment on lines
+15
to
+18
| use datafusion::{ | ||
| catalog::TableProvider, datasource::listing::ListingTableUrl, error::DataFusionError, | ||
| prelude::SessionContext, | ||
| }; |
Pin the behaviors the data-lake refactor introduced: - PersistentSchemaProvider unit tests (in-memory tables store): registration persists the definition and registers it in the catalog; re-registering an existing name overwrites instead of erroring (the behavior MemorySchemaProvider lacks, relied on by MV refresh / Iceberg replace+alter); deregister clears both the catalog and the persisted definition; ensure_default_table is idempotent. - Runtime restart-recovery test: a base table (with data) and a dependent view created by one runtime are rebuilt by a fresh runtime via init_tables, proving the persist-on-register + dependency-ordered startup recovery round trip that replaces the old TableManager registry.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
beacon-data-lakewrapped DataFusion'sSessionContextin two manager structs that had grown redundant with it:TableManagerwas registered as thebeacon.publicschema provider but maintained its ownHashMaptable registry — duplicating exactly what DataFusion's nativeMemorySchemaProvideralready does. Its only non-redundant jobs were persisting table definitions and rebuilding them at startup.FileManagerwas mostly thin wrappers over constants and free functions, plus dataset discovery.DataLakestruct that bundled them was built inRuntime::new, had its managers extracted, then dropped — its ownSchemaProviderimpl was never registered.This collapses that indirection so table/file access goes through native DataFusion APIs.
Changes
TableManager→ thinPersistentSchemaProvider— delegates the in-memory catalog (lookup/register/deregister/names) to a nativeMemorySchemaProvider, keeping only the one thing DataFusion doesn't do: persisting/removingtables://<name>/table.jsonon register/deregister.init_tables(ctx, schema)reusing the existingloading/orderingmodules. Deletedtable_manager.rsandprovider_factory.rs.list_table_configreconstructs theTableDefinitionfrom the live provider via the extracteddefinition_from_providerhelper (no parallel definition registry).FileManager→ free functions (create_listing_url,create_temp_output_file,list_datasets,list_dataset_schema); thefile_formatsvec moves intoRuntime. Deletedfiles/manager.rs.DataLakestruct →register_object_stores(ctx)free function.runtime.rs, querycompiler/from/output) drop the manager params and usesession_ctx.table_provider(...), the catalog, and the free functions. The DDL handlers inactions.rs/materialized_view.rsare unchanged and still auto-persist through the schema provider.Behavioral note
DataFusion's
MemorySchemaProvider::register_tableerrors on an existing name, whereas the oldTableManagersilently overwrote.PersistentSchemaProvider::register_tablenow deregisters-then-registers to preserve the overwrite-on-existing semantics that the materialized-view refresh / Iceberg replace / alter paths rely on. (Caught by the materialized-view tests.)Net ~390 fewer lines.
Testing
cargo build(full workspace) ✅cargo test -p beacon-data-lake— 6 passed ✅cargo test -p beacon-core— 18 unit tests + iceberg integration test (covers CREATE/INSERT/DROP/REFRESH/VIEW + JSON/output paths end-to-end) ✅cargo test -p beacon-api— 4 passed ✅