fix(oracle): optimize splitViaTableIteration to use single boundary query#914
fix(oracle): optimize splitViaTableIteration to use single boundary query#914mvanhorn wants to merge 6 commits intodatazip-inc:stagingfrom
Conversation
Co-authored-by: Vaibhav <vaibhav@datazip.io> Co-authored-by: Ankit Sharma <111491139+hash-data@users.noreply.github.com>
…uery Replace N+1 round trips with a single NTILE-based query that fetches all chunk boundary ROWIDs at once. Falls back to the original loop approach when table stats are unavailable. Also extracts buildChunksFromStartRowIDs helper shared with splitViaRowId for DRY chunk construction. Fixes datazip-inc#872
|
@mvanhorn Please raise your PR on top of staging branch instead of master. Also you will have to sign the CLA first. Also can you please comment on the issue so that I can assign the issue to you. |
|
Retargeted to staging. Commented on #872 for assignment. Will sign the CLA now. |
- Pass approxRowCount from caller to avoid duplicate stats fetch - Use utils.Ternary for rowsPerChunk and numberOfChunks guards - Rename numChunks to numberOfChunks - Add oracleNTILERowLimit (500M) threshold: fall back to loop path on very large tables to avoid ORA-1652 temp tablespace pressure from the single-pass NTILE sort
|
@vaibhav-datazip addressed all four in 57ceb8f. Items 1-3 (backfill.go): Item 4 (ORA-1652 risk): Went with a row-count threshold fallback rather than rewriting the NTILE query. Added Reasoning: the loop fallback already exists and already runs when stats are missing or the boundary query fails. Gating by row count is the smallest change that removes the 2B-row scenario you flagged without adding a new query path. Two alternatives if the threshold feels too conservative:
Open question: should the 500M cap stay hardcoded, or wired to a config field? No existing oracle driver config to hook it to, so I left it as a package constant. |
|
Hi @mvanhorn, The TODO mainly focuses on improving the existing query by either:
Regarding the NTILE approach, it would be great if you can research for alternative approaches that:
You can also take a look at the commented-out DB extents query. |
…mation For tables exceeding 500M rows, the NTILE query risks ORA-1652 and the iterative loop makes O(N) round trips. SAMPLE BLOCK reads a small percentage of data blocks (~0.1%), sorts the sampled ROWIDs, and picks evenly-spaced boundaries in a single query. Falls back to the iterative loop if the sample returns too few rows. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@vaibhav-datazip researched the alternatives you asked about. Here's what I found. Strategies evaluated
What I implemented in b74bfe6 Replaced the iterative loop fallback for tables >500M rows with How it works: Falls back to the iterative loop if the sample returns too few rows. About the commented-out DBA_EXTENTS approach ( The extents strategy reads physical file/block layout from If you want to support it for users who do have DBA access, the code is already written ( Performance comparison (estimated for 2B-row table)
Current priority order in the code
Let me know if you'd like me to also uncomment and wire up the extents path behind a config flag, or if the SAMPLE BLOCK approach covers the use case. |
|
Thanks for the detailed breakdown @mvanhorn , I’ll go through the changes and add a review soon. One thing to note is a small gap in the permissions table: for a normal Oracle DB (read-write), DBMS_PARALLEL_EXECUTE can still work even if the user only has read permissions. However, in read-only environments like Oracle ADG, it fails with ORA-16000 since it performs internal metadata writes. I’ll take a closer look at your changes and follow up with a detailed review soon. |
|
Good catch on the ADG nuance @vaibhav-datazip, that belongs in the table:
So the fallback chain for a 2B-row table in an ADG replica should degrade as: I haven't actually exercised that chain against a real ADG replica end-to-end, though. Happy to add an integration test or a note in the docs once you've had a chance to review b74bfe6. |
| SELECT | ||
| bucket_id, | ||
| MIN(row_id) AS min_row_id, | ||
| MAX(row_id) AS max_row_id |
There was a problem hiding this comment.
you can remove max_row_id from this query as this is not being used in chunk generation
| return chunks, nil | ||
| } | ||
|
|
||
| rows, err := o.client.QueryContext(ctx, jdbc.AllChunkBoundaryRowIDsQuery(stream, numberOfChunks)) |
There was a problem hiding this comment.
Won't it be better to use just the splitViaTableIterationLoop instead of NTILE query(removing ntile completely), as a fallback to sampling query.
As it might happen that the table stats are not updated and NTILE runs for 2B rows table, this will cause high memory usage in the user db.
| if samplePercent < 0.001 { | ||
| samplePercent = 0.001 | ||
| } | ||
| if samplePercent > 50 { | ||
| samplePercent = 50 | ||
| } |
There was a problem hiding this comment.
this can be simplified to this
| if samplePercent < 0.001 { | |
| samplePercent = 0.001 | |
| } | |
| if samplePercent > 50 { | |
| samplePercent = 50 | |
| } | |
| const samplePercentMin = 0.001 | |
| const samplePercentMax = 50.0 | |
| samplePercent = math.Max(samplePercentMin, math.Min(samplePercentMax, samplePercent)) |
| func (o *Oracle) splitViaTableIterationSample(ctx context.Context, stream types.StreamInterface, approxRowCount int64, numberOfChunks int64) (*types.Set[types.Chunk], error) { | ||
| // We want at least 10x numberOfChunks sample rows for reasonable boundary accuracy. | ||
| minSampleRows := numberOfChunks * 10 | ||
| samplePercent := float64(minSampleRows) / float64(approxRowCount) * 100.0 | ||
|
|
||
| // Clamp: Oracle SAMPLE requires (0.000001, 100). Floor at 0.001 for practical minimum. |
There was a problem hiding this comment.
can you add comments , explaining the basis of number being used here , eg why multiplication factor of 10 is being used for min sample rows. why min clamp is 0.001 and max is 50 etc.
Description
The
splitViaTableIterationfunction made N+1 database round trips to compute chunk boundaries. For large tables, this meant hundreds of sequential queries.This replaces the loop with a single NTILE-based query that fetches all chunk boundary ROWIDs at once. Falls back to the original loop when table stats are unavailable.
Also extracts
buildChunksFromStartRowIDsas a shared helper withsplitViaRowIdto remove duplicate chunk construction logic.Fixes #872
Type of change
How Has This Been Tested?
drivers/oracle/...andpkg/jdbc/...Screenshots or Recordings
N/A - backend optimization
Documentation
N/A - internal implementation change
This contribution was developed with AI assistance (Codex).