Skip to content

Federated remote-table support over Flight SQL#284

Merged
robinskil merged 4 commits into
mainfrom
features/federated-table
Jun 18, 2026
Merged

Federated remote-table support over Flight SQL#284
robinskil merged 4 commits into
mainfrom
features/federated-table

Conversation

@robinskil

Copy link
Copy Markdown
Collaborator

Summary

Adds federated remote tables: an admin registers a table that points at a table on another Beacon instance, and queries push as much work as possible (filters, projection, limit, joins, aggregates) down to the remote so only the reduced result crosses the network.

CREATE EXTERNAL TABLE remote_obs STORED AS REMOTE
  LOCATION 'beacon://other-host:50051/obs'
  OPTIONS ('username' 'admin', 'password' 'secret', 'tls' 'false');

SELECT count(*), avg(val) FROM remote_obs WHERE id > 100;  -- filter+aggregate pushed to the remote

Built on datafusion-federation (pinned =0.5.3, the release targeting datafusion ^53, matching our 53.1.0). Its optimizer rule federates the largest sub-plan rooted at remote tables and runs it on the remote via a SQLExecutor backed by an Arrow Flight SQL client — the same transport Beacon already serves.

Changes

  • New module beacon-datafusion-ext/src/remote/
    • RemoteConnection — Flight SQL client + Basic→Bearer handshake
    • BeaconFlightSqlExecutor — implements federation's SQLExecutor; runs pushed SQL on the remote and streams Arrow batches back (async→sync bridge)
    • RemoteTableDefinition (typetag-serde) + build_provider — pins the schema via a LIMIT 0 fetch and builds the federated provider
  • Runtime wiring — register default_optimizer_rules() and add federation's FederatedPlanner to BeaconQueryPlanner's extension planners
  • DDL routingSTORED AS REMOTE branches to the federated builder in create_external_table
  • Persistence — recover the definition from the registered provider so it round-trips to table.json; reload pins the stored schema, so a down remote doesn't block startup

Design decisions

  • Setup surface: reuse CREATE EXTERNAL TABLE … STORED AS REMOTE (no new endpoint/parser)
  • Credentials: stored inline in table.json (plaintext). Creation is admin-gated DDL; flagged here for visibility
  • Pushdown: full query-plan pushdown via datafusion-federation, not hand-rolled filter pushdown

Testing

  • End-to-end loopback federation test — one runtime serves obs, a remote_obs table federates back to it: SELECT count(*),sum(val) WHERE id>1 returns the correct result, and EXPLAIN confirms a federated/virtual scan node. Exercises auth, schema fetch, pushdown, and streaming.
  • Unit tests: RemoteTableDefinition serde round-trip, parse_remote_location.
  • No regressions: all Flight SQL tests and beacon-core lib tests pass.

Notes

  • One gotcha handled: DataFusion prefixes OPTIONS keys lacking a . with format., so option lookups check both forms.
  • Reload-on-restart is covered by serde tests + the no-network pinned-schema path rather than a live restart test.

Let an admin register a table that points at another Beacon instance and
push query work (filters, projection, limit, joins, aggregates) down to
the remote so only the reduced result crosses the network:

    CREATE EXTERNAL TABLE remote_obs STORED AS REMOTE
      LOCATION 'beacon://other-host:50051/obs'
      OPTIONS ('username' 'admin', 'password' 'secret');

Built on datafusion-federation (pinned =0.5.3, the release targeting
datafusion ^53): its optimizer rule federates the largest sub-plan rooted
at remote tables and runs it on the remote via a SQLExecutor backed by an
Arrow Flight SQL client.

- beacon-datafusion-ext/src/remote: RemoteConnection (Flight SQL client +
  handshake), BeaconFlightSqlExecutor (SQLExecutor), RemoteTableDefinition
  (typetag-serde) + provider with schema pinned at registration.
- runtime: register default_optimizer_rules() and add FederatedPlanner to
  BeaconQueryPlanner's extension planners.
- actions: route STORED AS REMOTE to the federated builder.
- schema_persistence: recover the definition from the registered provider
  so it round-trips to table.json; reload uses the pinned schema (no remote
  needed at startup).

Credentials are stored inline in table.json (admin-gated DDL).

Tests: end-to-end loopback federation (filter+aggregate pushdown, auth,
streaming, federated plan node), RemoteTableDefinition serde round-trip,
and parse_remote_location. Existing Flight SQL and beacon-core suites green.
Copilot AI review requested due to automatic review settings June 18, 2026 10:44
Add a Remote Tables (Federation) page covering STORED AS REMOTE: the
beacon:// LOCATION format, OPTIONS (username/password/tls), how
filter/projection/limit/join/aggregate pushdown works over Flight SQL,
schema pinning at creation, restart behavior, and limitations. Wire it
into the data-lake sidebar and cross-link from the external-tables page.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds federated “remote tables” backed by Arrow Flight SQL, allowing DataFusion plans rooted at remote Beacon tables to be pushed down and executed on a remote Beacon instance via datafusion-federation.

Changes:

  • Introduces a new beacon-datafusion-ext::remote module (connection, executor, table definition/provider adaptor) to run pushed SQL over Flight SQL and stream results back.
  • Wires datafusion-federation into runtime planning (optimizer rules + FederatedPlanner) and adds DDL routing for CREATE EXTERNAL TABLE … STORED AS REMOTE.
  • Adds persistence support (recovering remote definitions from registered providers) and an end-to-end loopback federation test over Flight SQL.

Reviewed changes

Copilot reviewed 16 out of 17 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
Cargo.toml Adds workspace dependencies for federation + tonic transport.
Cargo.lock Locks new dependency graph entries for federation/tonic/flight-sql usage.
beacon-datafusion-ext/src/remote/mod.rs Remote-table module entrypoint + definition recovery from federated providers.
beacon-datafusion-ext/src/remote/executor.rs Implements federation SQLExecutor that executes pushed SQL remotely over Flight SQL.
beacon-datafusion-ext/src/remote/definition.rs Adds persisted RemoteTableDefinition and builds federated providers with pinned schema.
beacon-datafusion-ext/src/remote/connection.rs Adds Flight SQL client connection + handshake logic for remote Beacon instances.
beacon-datafusion-ext/src/lib.rs Exposes the new remote module publicly.
beacon-datafusion-ext/Cargo.toml Adds crate-level deps for federation + flight-sql client support.
beacon-data-lake/src/table_runtime/schema_persistence.rs Persists remote table definitions by recovering them from registered providers.
beacon-core/src/statement_plan/query_planner.rs Registers FederatedPlanner to lower federation extension nodes.
beacon-core/src/statement_plan/actions.rs Routes STORED AS REMOTE DDL and parses beacon://… remote locations/options.
beacon-core/src/runtime.rs Enables federation optimizer rules in the DataFusion session.
beacon-core/Cargo.toml Adds datafusion-federation dependency to core.
beacon-api/src/flight_sql/tests.rs Adds loopback end-to-end federation test validating pushdown + streaming.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +22 to +25
/// Maps any displayable error into a DataFusion external error.
fn remote_err<E: std::fmt::Display>(error: E) -> DataFusionError {
DataFusionError::External(format!("remote beacon: {error}").into())
}
Comment on lines +11 to +18
#[derive(Clone, Debug)]
pub struct RemoteConnection {
/// gRPC endpoint of the remote Flight SQL server, e.g. `http://host:50051`.
pub url: String,
pub username: Option<String>,
pub password: Option<String>,
}

Comment on lines +34 to +53
pub async fn connect(&self) -> anyhow::Result<FlightSqlServiceClient<Channel>> {
let channel = Endpoint::from_shared(self.url.clone())
.with_context(|| format!("invalid remote beacon endpoint '{}'", self.url))?
.connect()
.await
.with_context(|| format!("failed to connect to remote beacon at '{}'", self.url))?;

let mut client = FlightSqlServiceClient::new(channel);

if let Some(username) = &self.username {
let password = self.password.as_deref().unwrap_or_default();
client
.handshake(username, password)
.await
.with_context(|| format!("Flight SQL handshake with '{}' failed", self.url))?;
}

Ok(client)
}
}
Comment on lines +23 to +39
#[derive(Clone, Debug, serde::Serialize, serde::Deserialize)]
pub struct RemoteTableDefinition {
/// Local logical table name.
pub name: String,
/// gRPC endpoint of the remote Flight SQL server, e.g. `http://host:50051`.
pub url: String,
/// Table name on the remote instance.
pub remote_table: String,
#[serde(default)]
pub username: Option<String>,
#[serde(default)]
pub password: Option<String>,
/// Pinned output schema. An empty schema means "fetch from the remote when
/// building the provider" (and the resolved schema is then pinned).
pub schema: SchemaRef,
}

Comment on lines +161 to +164
let tls = tls_option
.map(|v| v.eq_ignore_ascii_case("true"))
.unwrap_or(false);
let scheme = if tls { "https" } else { "http" };
The /api/table-config endpoint serializes the full table definition and is
reachable unauthenticated, so a remote table's inline username/password
would leak. Add TableDefinition::sensitive_keys() (default none), declare
["username", "password"] for RemoteTableDefinition, and mask those fields
in TableConfigView. Also hand-write RemoteTableDefinition's Debug so creds
can't reach logs via {:?}. Persistence (table.json) still keeps the real
values so the table can reconnect.

Test: TableConfigView masks credentials while keeping non-secret fields.
Instead of storing username/password in table.json, remote tables now
connect to the remote Flight SQL server anonymously (no handshake, no
token). The remote must allow anonymous Flight SQL access
(BEACON_FLIGHT_SQL_ALLOW_ANONYMOUS=true), which is read-only — exactly
what federation needs. This removes the secret-at-rest entirely, so the
earlier credential-redaction machinery is no longer needed and is reverted:

- RemoteConnection/RemoteTableDefinition: drop username/password.
- actions: drop the username/password OPTIONS (keep tls).
- Revert TableDefinition::sensitive_keys() and TableConfigView redaction.
- Federation loopback test now runs against an anonymous remote with no creds.
- Docs updated: anonymous-access requirement, no credential OPTIONS.
@robinskil robinskil merged commit ef6123e into main Jun 18, 2026
1 of 2 checks passed
@robinskil robinskil mentioned this pull request Jun 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants