Skip to content

feat(filters): add TOML filters for bq query and bq show with CSV compression#896

Open
fkztw wants to merge 1 commit intortk-ai:developfrom
fkztw:feat/bq-toml-filter
Open

feat(filters): add TOML filters for bq query and bq show with CSV compression#896
fkztw wants to merge 1 commit intortk-ai:developfrom
fkztw:feat/bq-toml-filter

Conversation

@fkztw
Copy link
Copy Markdown

@fkztw fkztw commented Mar 28, 2026

Description

This PR addresses token bloat from BigQuery commands bq query and bq show by introducing TOML filters designed to mitigate noise and aggressively compress output schema and large results.

Inspired by the TOON (Token-Oriented Object Notation) project format architecture, this filter drops the inherently expensive ASCII table layouts and structural paddings typical of BigQuery CLI outputs.
By transforming the output into a more streamlined, CSV-like footprint during the proxy ingestion phase, we achieved the following optimizations:

  • Aggressive Noise Stripping: Excludes ubiquitous gcloud update warnings, BQ job submission statuses, and purely decorative ASCII borders (+---+---+).
  • TOON-Style Tabular Conversion: Dynamically evaluates standard CLI tables and replaces internal | paddings with dense comma-separated syntax. This successfully drops raw token per-line consumption by up to 40-80% depending on row width.
  • Increased Safeguard Windows: Due to significantly lower token burn per row, this modification raises the truncation limits max_lines safely from standard bounds up to 100 lines. This empowers LLM reasoning by providing 2.5x more rows of context for the same historical compute bandwidth.
  • Schema & Structured Log Resilience: Hardened tests using realistic, anonymized LTA/GA4 offline datasets confirm that multi-line JSON structures generated via REPEATED RECORDS smoothly translate effectively down into valid sparse rows.

Testing & Verification

Inline unit tests were expanded to comprehensively cover various complex schemas (JSON multi-line payload representations and massive clustered dataset partition listings). Real-world anonymized queries gathered from our local engineering team have validated our aggressive savings metric assumptions.

- Filters out noise like gcloud update warnings and job progress status
- Implements max_lines=40 and truncate_lines_at=120 to guard against large payloads
- Registers these filters in discover/rules.rs to track savings
- Adjusts test suite counts to account for the new filters
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant