fix(oracle): optimize splitViaTableIteration to use single boundary query by mvanhorn · Pull Request #914 · datazip-inc/olake

mvanhorn · 2026-04-09T06:34:55Z

Description

The splitViaTableIteration function made N+1 database round trips to compute chunk boundaries. For large tables, this meant hundreds of sequential queries.

This replaces the loop with a single NTILE-based query that fetches all chunk boundary ROWIDs at once. Falls back to the original loop when table stats are unavailable.

Also extracts buildChunksFromStartRowIDs as a shared helper with splitViaRowId to remove duplicate chunk construction logic.

Fixes #872

Type of change

Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Verified Go compilation passes for both drivers/oracle/... and pkg/jdbc/...
Reviewed NTILE query produces correct bucket boundaries matching the original loop output
Fallback path preserves original behavior when stats are unavailable

Screenshots or Recordings

N/A - backend optimization

Documentation

N/A - internal implementation change

This contribution was developed with AI assistance (Codex).

Co-authored-by: Vaibhav <vaibhav@datazip.io> Co-authored-by: Ankit Sharma <111491139+hash-data@users.noreply.github.com>

…uery Replace N+1 round trips with a single NTILE-based query that fetches all chunk boundary ROWIDs at once. Falls back to the original loop approach when table stats are unavailable. Also extracts buildChunksFromStartRowIDs helper shared with splitViaRowId for DRY chunk construction. Fixes datazip-inc#872

CLAassistant · 2026-04-09T06:35:08Z

All committers have signed the CLA.

nayanj98 · 2026-04-09T07:21:27Z

@mvanhorn Please raise your PR on top of staging branch instead of master. Also you will have to sign the CLA first. Also can you please comment on the issue so that I can assign the issue to you.

mvanhorn · 2026-04-09T13:53:14Z

Retargeted to staging. Commented on #872 for assignment. Will sign the CLA now.

- Pass approxRowCount from caller to avoid duplicate stats fetch - Use utils.Ternary for rowsPerChunk and numberOfChunks guards - Rename numChunks to numberOfChunks - Add oracleNTILERowLimit (500M) threshold: fall back to loop path on very large tables to avoid ORA-1652 temp tablespace pressure from the single-pass NTILE sort

mvanhorn · 2026-04-11T13:27:20Z

@vaibhav-datazip addressed all four in 57ceb8f.

Items 1-3 (backfill.go): splitViaTableIteration now takes approxRowCount from the caller, no more duplicate NumRows fetch. rowsPerChunk and the renamed numberOfChunks both use utils.Ternary.

Item 4 (ORA-1652 risk): Went with a row-count threshold fallback rather than rewriting the NTILE query. Added oracleNTILERowLimit = 500_000_000 at package scope in backfill.go. Tables above that cap skip the NTILE path and reuse splitViaTableIterationLoop, which bounds memory per query.

Reasoning: the loop fallback already exists and already runs when stats are missing or the boundary query fails. Gating by row count is the smallest change that removes the 2B-row scenario you flagged without adding a new query path.

Two alternatives if the threshold feels too conservative:

SAMPLE(p) inside the NTILE CTE, bounded memory but boundaries drift from uniform
Coarse outer range split by extent, then NTILE per range, more uniform but needs a new loop

Open question: should the 500M cap stay hardcoded, or wired to a config field? No existing oracle driver config to hook it to, so I left it as a package constant.

vaibhav-datazip · 2026-04-13T12:19:22Z

Hi @mvanhorn,
Our primary goal here is to minimize the load on the database, while shifting as much computation as possible to the OLake runtime. As well as enable fast full loads even for very large datasets (billions of rows).

The TODO mainly focuses on improving the existing query by either:

reducing the number of network calls, or
using a more efficient query that operates with minimal database permissions.

Regarding the NTILE approach, it would be great if you can research for alternative approaches that:

scale reliably regardless of table size,
and provide measurable performance improvements.

You can also take a look at the commented-out DB extents query.
Additionally, it would be useful to evaluate and quantify the expected performance gains from any proposed approach.

…mation For tables exceeding 500M rows, the NTILE query risks ORA-1652 and the iterative loop makes O(N) round trips. SAMPLE BLOCK reads a small percentage of data blocks (~0.1%), sorts the sampled ROWIDs, and picks evenly-spaced boundaries in a single query. Falls back to the iterative loop if the sample returns too few rows. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mvanhorn · 2026-04-13T13:30:27Z

@vaibhav-datazip researched the alternatives you asked about. Here's what I found.

Strategies evaluated

Strategy	DB Load	Scales to 1B+	Permissions	Network Calls
NTILE (current, <500M)	High (full sort)	No (ORA-1652)	SELECT	1
Iterative loop (old fallback)	Medium	Yes	SELECT	O(N)
DBMS_PARALLEL_EXECUTE (primary)	Low (block-based)	Yes	EXECUTE on pkg	3
DBA_EXTENTS (commented out)	Very low (dict only)	Yes	DBA/SELECT_CATALOG	2
SAMPLE BLOCK (new, `b74bfe6`)	Low (~0.1% of blocks)	Yes	SELECT	1

What I implemented in b74bfe6

Replaced the iterative loop fallback for tables >500M rows with SAMPLE BLOCK estimation (splitViaTableIterationSample in backfill.go:193).

How it works: SELECT ROWID FROM schema.table SAMPLE BLOCK(P) ORDER BY ROWID reads P% of data blocks (not rows), sorts the small sample, picks evenly-spaced boundaries. For a 2B-row table needing 1000 chunks, P is ~0.001% - reads ~20K rows instead of 2B. One query, SELECT-only permissions, no temp tablespace pressure.

Falls back to the iterative loop if the sample returns too few rows.

About the commented-out DBA_EXTENTS approach (jdbc.go:993)

The extents strategy reads physical file/block layout from DBA_EXTENTS and converts to ROWIDs via DBMS_ROWID.ROWID_CREATE. Zero table scan, purely dictionary-based. Fastest possible option. The blocker is permissions: DBA_EXTENTS requires DBA privilege or SELECT_CATALOG_ROLE. Most OLake users connecting with minimal read-only credentials won't have this.

If you want to support it for users who do have DBA access, the code is already written (backfill.go:274-342). Could be gated behind a config flag. USER_EXTENTS works for tables owned by the connected user but not for cross-schema reads.

Performance comparison (estimated for 2B-row table)

Strategy	Temp Space	Query Time	Round Trips
NTILE	~20GB+ sort	Minutes (risks ORA-1652)	1
Loop (N=1000 chunks)	None	~10-30s per iteration	1000+
SAMPLE BLOCK (0.001%)	<1MB sort	<1s	1
DBA_EXTENTS	None	<100ms (dict lookup)	2

Current priority order in the code

splitViaRowId (DBMS_PARALLEL_EXECUTE) - tried first
splitViaTableIteration (NTILE for <500M, SAMPLE BLOCK for >500M) - fallback
splitViaTableIterationLoop - last resort if both NTILE and SAMPLE fail

Let me know if you'd like me to also uncomment and wire up the extents path behind a config flag, or if the SAMPLE BLOCK approach covers the use case.

vaibhav-datazip · 2026-04-13T13:55:38Z

Thanks for the detailed breakdown @mvanhorn , I’ll go through the changes and add a review soon.

One thing to note is a small gap in the permissions table: for a normal Oracle DB (read-write), DBMS_PARALLEL_EXECUTE can still work even if the user only has read permissions. However, in read-only environments like Oracle ADG, it fails with ORA-16000 since it performs internal metadata writes.
Since we rely on fallback strategies in such cases, we should ensure they work in these environments.

I’ll take a closer look at your changes and follow up with a detailed review soon.

mvanhorn · 2026-04-15T01:45:42Z

Good catch on the ADG nuance @vaibhav-datazip, that belongs in the table:

Strategy	Works in ADG / read-only?	Why
`DBMS_PARALLEL_EXECUTE` (primary)	No	Internal metadata writes fail with ORA-16000 on standby
NTILE	Yes	Pure `SELECT` + sort in temp tablespace
`SAMPLE BLOCK` (new fallback)	Yes	Pure `SELECT` with sampling clause, no writes
Iterative loop (last resort)	Yes	Pure `SELECT` bounded-range queries
`DBA_EXTENTS`	Yes	Dictionary-only lookup, no writes

So the fallback chain for a 2B-row table in an ADG replica should degrade as: splitViaRowId fails with ORA-16000 -> splitViaTableIteration handles it (NTILE for <500M rows, SAMPLE BLOCK for larger) -> splitViaTableIterationLoop if both fail. All three fallback paths are SELECT-only so none of them should hit the ORA-16000 problem.

I haven't actually exercised that chain against a real ADG replica end-to-end, though. Happy to add an integration test or a note in the docs once you've had a chance to review b74bfe6.

vaibhav-datazip · 2026-04-15T09:19:51Z

+		SELECT
+			bucket_id,
+			MIN(row_id) AS min_row_id,
+			MAX(row_id) AS max_row_id


you can remove max_row_id from this query as this is not being used in chunk generation

vaibhav-datazip · 2026-04-15T09:28:47Z

+		return chunks, nil
+	}
+
+	rows, err := o.client.QueryContext(ctx, jdbc.AllChunkBoundaryRowIDsQuery(stream, numberOfChunks))


Won't it be better to use just the splitViaTableIterationLoop instead of NTILE query(removing ntile completely), as a fallback to sampling query.
As it might happen that the table stats are not updated and NTILE runs for 2B rows table, this will cause high memory usage in the user db.

vaibhav-datazip · 2026-04-15T09:52:44Z

+	if samplePercent < 0.001 {
+		samplePercent = 0.001
+	}
+	if samplePercent > 50 {
+		samplePercent = 50
+	}


this can be simplified to this

Suggested change

if samplePercent < 0.001 {

samplePercent = 0.001

}

if samplePercent > 50 {

samplePercent = 50

}

const samplePercentMin = 0.001

const samplePercentMax = 50.0

samplePercent = math.Max(samplePercentMin, math.Min(samplePercentMax, samplePercent))

vaibhav-datazip · 2026-04-15T10:31:04Z

+func (o *Oracle) splitViaTableIterationSample(ctx context.Context, stream types.StreamInterface, approxRowCount int64, numberOfChunks int64) (*types.Set[types.Chunk], error) {
+	// We want at least 10x numberOfChunks sample rows for reasonable boundary accuracy.
+	minSampleRows := numberOfChunks * 10
+	samplePercent := float64(minSampleRows) / float64(approxRowCount) * 100.0
+
+	// Clamp: Oracle SAMPLE requires (0.000001, 100). Floor at 0.001 for practical minimum.


can you add comments , explaining the basis of number being used here , eg why multiplication factor of 10 is being used for min sample rows. why min clamp is 0.001 and max is 50 etc.

vikaxsh and others added 2 commits April 8, 2026 23:16

chore: staging -> master v0.6.4 (datazip-inc#913)

2b977b7

Co-authored-by: Vaibhav <vaibhav@datazip.io> Co-authored-by: Ankit Sharma <111491139+hash-data@users.noreply.github.com>

mvanhorn changed the base branch from master to staging April 9, 2026 13:50

mvanhorn mentioned this pull request Apr 9, 2026

Make this function more efficient by using a single query to get all the rowids in the table using one query #872

Open

vaibhav-datazip reviewed Apr 11, 2026

View reviewed changes

Comment thread drivers/oracle/internal/backfill.go Outdated

Comment thread drivers/oracle/internal/backfill.go Outdated

Comment thread drivers/oracle/internal/backfill.go Outdated

Comment thread pkg/jdbc/jdbc.go

Merge branch 'staging' into osc/872-optimize-table-iteration

68d4481

Merge branch 'staging' into osc/872-optimize-table-iteration

d2030e7

vaibhav-datazip reviewed Apr 15, 2026

View reviewed changes

Conversation

mvanhorn commented Apr 9, 2026

Description

Type of change

How Has This Been Tested?

Screenshots or Recordings

Documentation

Uh oh!

CLAassistant commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nayanj98 commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mvanhorn commented Apr 9, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mvanhorn commented Apr 11, 2026

Uh oh!

vaibhav-datazip commented Apr 13, 2026

Uh oh!

mvanhorn commented Apr 13, 2026

Uh oh!

vaibhav-datazip commented Apr 13, 2026

Uh oh!

mvanhorn commented Apr 15, 2026

Uh oh!

vaibhav-datazip Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

vaibhav-datazip Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vaibhav-datazip Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

vaibhav-datazip Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

CLAassistant commented Apr 9, 2026 •

edited

Loading

nayanj98 commented Apr 9, 2026 •

edited

Loading

vaibhav-datazip Apr 15, 2026 •

edited

Loading