I want to merge multiple tables with the same into a single one.
Say I have a table Sales(transaction_id, date, amount) and sharded files in my file system,
sales/
- 2024/
- 12/
- 30.csv
- 31.csv
- 2025/
- 01/
- 01.csv
- 02.csv
- 03.csv
Is there a convenient way to treat sales/**/*.csv as a single table?
So far it seems that bdt query supports 2 flags for input tables,
--table path/to/single_file.csv
- A single table is read with name
single_file
--tables path/to/directory/
- Multiple tables are read, each one with it's own basename
- This imports N tables
- I don't see much value here, why not using the shell to expand something like
path/to/directory/*.csv?
I kind of want a new input file flag that expects a table name, and a set of (compatible) files,
bdt query \
--partitioned_table sales sales/**/*.csv \ # Shell will expand these globs
--sql "
select
count(*)
from
sales
"
Which would use a flag with 1+N arguments, --partitioned_table sales sales/2024/12/30.csv sales/2024/12/31.csv sales/2025/01/01.csv sales/2025/01/02.csv sales/2025/01/03.csv, and make the table sales available.
Is there a way to get this today? I tried the --tables flag, but instead got N different tables that were hard to work with as a unit.
It's not hard to create a single file that concatenates all tables, but I'd nice not needing to create it as it'd allow writing queries from the shell, with a tiny rewrite --partitioned_table sales sales/2024/12/*.csv would get me info about sales in December 2024 without any made-up disk writes.
I want to merge multiple tables with the same into a single one.
Say I have a table
Sales(transaction_id, date, amount)and sharded files in my file system,Is there a convenient way to treat
sales/**/*.csvas a single table?So far it seems that
bdt querysupports 2 flags for input tables,--table path/to/single_file.csvsingle_file--tables path/to/directory/path/to/directory/*.csv?I kind of want a new input file flag that expects a table name, and a set of (compatible) files,
Which would use a flag with 1+N arguments,
--partitioned_table sales sales/2024/12/30.csv sales/2024/12/31.csv sales/2025/01/01.csv sales/2025/01/02.csv sales/2025/01/03.csv, and make the tablesalesavailable.Is there a way to get this today? I tried the
--tablesflag, but instead got N different tables that were hard to work with as a unit.It's not hard to create a single file that concatenates all tables, but I'd nice not needing to create it as it'd allow writing queries from the shell, with a tiny rewrite
--partitioned_table sales sales/2024/12/*.csvwould get me info about sales in December 2024 without any made-up disk writes.