Skip to content

Add post-processing SQL over semantic query results#135

Merged
nicosuave merged 5 commits intomainfrom
nicosuave/subquery-postprocess
Apr 11, 2026
Merged

Add post-processing SQL over semantic query results#135
nicosuave merged 5 commits intomainfrom
nicosuave/subquery-postprocess

Conversation

@nicosuave
Copy link
Copy Markdown
Member

Summary

  • Adds support for composing arbitrary SQL (CASE, window functions, arithmetic, COALESCE, etc.) on top of semantic query results via subquery wrapping. The inner query compiles through the semantic pipeline; the outer query is plain SQL left untouched.
  • Refactors the query rewriter to walk the AST recursively (bottom-up), enabling nested subqueries and semantic subqueries inside JOIN clauses.
  • Adds a post_process parameter to compile() and query() for the Python API path, with automatic CTE hoisting so the composed SQL is valid.
  • 15 new tests covering CASE expressions, window functions, metric arithmetic, deeply nested subqueries, JOIN subqueries, and the post_process API.

Support arbitrary SQL (CASE, window functions, arithmetic, etc.) on top
of semantic query results via subquery wrapping. The rewriter now walks
the query tree recursively so nested subqueries and JOIN subqueries that
reference semantic models are compiled correctly.

Also adds a post_process parameter to compile() and query() for the
Python API path, with automatic CTE hoisting.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 552c2d9091

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +121 to +123
has_subquery_in_joins = any(isinstance(join.this, exp.Subquery) for join in (parsed.args.get("joins") or []))

if has_ctes or has_subquery_in_from:
if has_ctes or has_subquery_in_from or has_subquery_in_joins:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve root semantic rewrite when JOIN has subquery

Routing every query with a JOIN subquery through _rewrite_with_ctes_or_subqueries skips _rewrite_simple_query for the root SELECT, because _rewrite_select_tree only rewrites child scopes. For queries whose root FROM is a semantic model (for example, FROM orders o JOIN (SELECT ...) s), metric references like o.revenue are no longer rewritten through the semantic pipeline or rejected by the explicit JOIN guard, so execution can now fail with binder errors or silently return raw row-level columns instead of semantic aggregates.

Useful? React with 👍 / 👎.

When the root FROM references a semantic model and a JOIN contains a
subquery, the routing into _rewrite_with_ctes_or_subqueries must still
apply _rewrite_simple_query to the root SELECT so the explicit JOIN
guard and semantic rewriting are not bypassed.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c3565e5730

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1865 to +1866
if self._references_semantic_model(parsed):
return self._rewrite_simple_query(parsed)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep CTE scope when rewriting root semantic queries

When a query has a root semantic FROM plus a WITH clause, this branch rewrites the whole root through _rewrite_simple_query, which regenerates SQL from extracted metrics/dimensions and drops the original CTE definitions. Any root predicate/expression that references those CTE names then fails at execution time (e.g., WITH allowed AS (...) SELECT ... FROM orders WHERE status IN (SELECT ... FROM allowed) now errors with missing table allowed). This regression is introduced by the new root rewrite path for CTE/subquery queries.

Useful? React with 👍 / 👎.

parsed_inner.set("with", None)
body = parsed_inner.sql(dialect=target_dialect)
ctes = with_clause.sql(dialect=target_dialect)
return ctes + "\n" + post_process.replace("{inner}", body)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Merge hoisted CTEs with outer post-process WITH

The post-processing path prepends hoisted inner CTEs directly in front of the user-supplied post_process SQL. If post_process itself starts with WITH, the result becomes WITH ...\nWITH ..., which is invalid SQL and fails to parse/execute. Since the feature is documented as allowing arbitrary SQL wrapping, this makes valid CTE-based post-processing unusable whenever the inner semantic query emits CTEs.

Useful? React with 👍 / 👎.

Fix two issues from PR review:

1. When the root SELECT references a semantic model and has user-defined
   CTEs, _rewrite_simple_query was discarding them. Now user CTEs are
   saved before rewriting and merged back into the generated SQL so
   filter references like IN (SELECT ... FROM user_cte) remain valid.

2. When post_process SQL has its own WITH clause and the inner semantic
   query also produces CTEs, the two WITH clauses are now merged into
   one instead of producing invalid double-WITH SQL.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 02194ec3a6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1879 to +1880
user_ctes = [cte.copy() for cte in original_with.expressions]
gen_with.set("expressions", user_ctes + list(gen_with.expressions))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve WITH RECURSIVE when merging root CTEs

When a root semantic query includes WITH RECURSIVE, this merge path copies only original_with.expressions into the generated WITH and drops the recursive flag from the original clause. The rewritten SQL becomes WITH ... instead of WITH RECURSIVE ..., so self-referencing CTEs fail at execution (e.g., DuckDB reports a circular CTE reference). I reproduced this with a root FROM orders query filtered by a recursive CTE; rewrite output omitted RECURSIVE.

Useful? React with 👍 / 👎.

Propagate the recursive flag from user-defined CTEs to the merged WITH
clause so self-referencing CTEs execute correctly.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6cc3958289

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

gen_with = rewritten.args.get("with")
if gen_with:
user_ctes = [cte.copy() for cte in original_with.expressions]
gen_with.set("expressions", user_ctes + list(gen_with.expressions))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid duplicate CTE aliases when merging root WITH clauses

When a root semantic query already has a user CTE name that matches an internally generated CTE (for example orders_cte), this merge creates duplicate aliases and the rewritten SQL fails to parse. I reproduced this with WITH orders_cte AS (...) SELECT orders.revenue FROM orders ..., which rewrote to two orders_cte entries and raised a parser error. Since generated CTE names are implementation details, users can hit this unintentionally; the merge needs collision handling (rename or namespace).

Useful? React with 👍 / 👎.

Comment on lines +573 to +574
merged = list(inner_with.expressions) + list(outer_with.expressions)
outer_with.set("expressions", merged)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Prevent CTE name collisions in post_process WITH hoisting

The post_process hoisting path prepends inner generated CTEs to outer post_process CTEs without checking alias conflicts, so a user CTE name like orders_cte collides with the semantic generator’s orders_cte and produces invalid SQL. Calling compile(..., post_process='WITH orders_cte AS (...) SELECT ... FROM ({inner})') currently yields duplicate orders_cte definitions and fails at execution time.

Useful? React with 👍 / 👎.

Two changes:

1. post_process path: remove CTE hoisting entirely. Inner SQL (with
   CTEs) is placed directly in the subquery position. CTEs inside
   subqueries are valid in all target databases and naturally scoped,
   so name collisions with post_process CTEs cannot occur.

2. Root semantic + user CTEs: detect name collisions between user CTEs
   and generated CTEs, raising a clear error instead of producing
   invalid SQL. Walk-based renaming was too aggressive (renamed user
   CTE references inside filter subqueries).
@nicosuave nicosuave merged commit 26e9294 into main Apr 11, 2026
15 checks passed
@nicosuave nicosuave deleted the nicosuave/subquery-postprocess branch April 11, 2026 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant