Skip to content

[Just for sharing] Local Spark-driven repro of concurrent-insert snapshot loss#614

Draft
dushyantk1509 wants to merge 1 commit into
linkedin:mainfrom
dushyantk1509:dushyantk1509/local-repro-concurrent-insert-test
Draft

[Just for sharing] Local Spark-driven repro of concurrent-insert snapshot loss#614
dushyantk1509 wants to merge 1 commit into
linkedin:mainfrom
dushyantk1509:dushyantk1509/local-repro-concurrent-insert-test

Conversation

@dushyantk1509
Copy link
Copy Markdown
Contributor

Mirrors TablesConcurrencyTest.concurrentInsertTest from the LinkedIn acceptance suite, but runs fully in-process against the embedded OpenHouse server via OpenHouseSparkITest. Two futures fire 5 INSERT INTO statements each against a fresh table; each returns HTTP 200, so the client-side committedRecord stays at 10. The final SELECT row count must equal committedRecord — any mismatch means a snapshot was silently dropped by the catalog's subtractive-merge commit path during the race.

Reproduces on first run today (rows=9, expected=10), matching the production data-loss signature observed on WAR (BaseTransaction:466 WARN, "Failed to load metadata for a committed snapshot, skipping clean-up") and the chronic failures of TablesConcurrencyTest on WAR DR.

Summary

Issue] Briefly discuss the summary of the changes made in this
pull request in 2-3 lines.

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

For all the boxes checked, please include additional details of the changes made in this pull request.

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

Mirrors TablesConcurrencyTest.concurrentInsertTest from the LinkedIn
acceptance suite, but runs fully in-process against the embedded
OpenHouse server via OpenHouseSparkITest. Two futures fire 5 INSERT
INTO statements each against a fresh table; each returns HTTP 200, so
the client-side committedRecord stays at 10. The final SELECT row count
must equal committedRecord — any mismatch means a snapshot was silently
dropped by the catalog's subtractive-merge commit path during the race.

Reproduces on first run today (rows=9, expected=10), matching the
production data-loss signature observed on WAR (BaseTransaction:466
WARN, "Failed to load metadata for a committed snapshot, skipping
clean-up") and the chronic failures of TablesConcurrencyTest on WAR DR.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
mkuchenbecker added a commit to mkuchenbecker/openhouse that referenced this pull request May 28, 2026
Replace the hand-crafted constructed-metadata unit tests with PR linkedin#614's
SparkConcurrentInsertFunctionalTest as the canonical regression target.
The test drives concurrent INSERT statements through the embedded
OpenHouseSparkITest server and asserts row_count == reported_success.

Honest validation against the canonical test:
  linkedin/main (pre-fix): FAIL — rows=9, expected=10
  this branch (with fix):  FAIL — rows=9, expected=10

The abortIfWriterBaseDivergedFromCatalog check is invoked 11 times during
the test run (one per insert) and never throws — in every invocation,
writerClaimedBase (COMMIT_KEY) matches base.metadataFileLocation() modulo
URI scheme. The race fires below the doCommit layer: both racing server
threads have already loaded the same cached `base` view by the time their
doCommits run, so the COMMIT_KEY-vs-base comparison can't detect the
race. HTS-layer CAS catches one writer (returns 409) but one row is
still permanently dropped — likely in the loser's client-side retry path
or in a second-tier race.

The fix as written addresses the BaseTransaction.applyUpdates silent-
rebase variant of the bug class (the production fingerprint observed on
WAR DR on 2026-05-25). It does not address the variant PR linkedin#614 exposes,
where both threads' applyUpdates' refresh-and-rebase sees the same v0
and both proceed to a metadata write at the same nominal base. That
variant requires a different defense — likely at HouseTableRepositoryImpl
.save() or at the metadata file pointer-CAS step.

This commit lands what we have so the PR carries both the canonical
behavior test and the partial fix; the racing-CREATE / writer-claims-
INITIAL paths added in the encapsulated helper are still useful guards
for other code paths but do not close PR linkedin#614's window.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mkuchenbecker added a commit to mkuchenbecker/openhouse that referenced this pull request May 28, 2026
This file used TableMetadataParser to write fake metadata.json bytes to
disk and called doCommit directly with constructed metadata — the
"direct file manipulation bs" the user explicitly rejected. PR linkedin#614's
SparkConcurrentInsertFunctionalTest is the canonical behavior repro;
delete this in favor of that.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mkuchenbecker added a commit to mkuchenbecker/openhouse that referenced this pull request May 28, 2026
PR linkedin#614's SparkConcurrentInsertFunctionalTest passes 4/4 with this change.

Root cause identified via end-to-end instrumentation:
- Two racing INSERTs reach houseTableRepository.save() with the same
  cached `tableVersion` (HTS-level CAS input).
- Both saves succeed at the JPA layer because HouseTable had no @Version
  field, so Spring Data JPA's em.merge() did not emit a versioned UPDATE.
- Result: last writer overwrites HTS pointer, loser's metadata.json is
  orphaned, loser's data file is unreachable. 10 inserts report success
  but only 9 rows visible.

Adding @Version makes em.merge produce
  UPDATE house_table SET ..., version = ? WHERE id = ? AND version = ?
which the H2 backend's MVCC enforces. The second writer's UPDATE matches
0 rows -> ObjectOptimisticLockingFailureException -> wrapped as 409 ->
Iceberg client retries with fresh base -> data lands.

Validated:
  Before:  test fails (rows=9, expected=10) - reliable repro
  After:   test passes 4 consecutive runs

This complements PR linkedin#611's catalog-level abortIfWriterBaseDivergedFromCatalog
check (the BaseTransaction.applyUpdates silent-rebase variant of the bug
class). Together they close two distinct races in the same bug class:
- Silent-rebase variant: covered by the doCommit-level CAS
- HTS-save variant: covered by this @Version field

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mkuchenbecker added a commit to mkuchenbecker/openhouse that referenced this pull request May 28, 2026
@mkuchenbecker
Copy link
Copy Markdown
Collaborator

Why this test isn't useful for confirming a prod bug

The test's plumbing

OpenHouseSparkITest boots SpringH2TestApplication which does NOT scan com.linkedin.openhouse.housetables.* — the real HTS service is absent. In its place,
HouseTablesH2Repository (in tables-test-fixtures, @primary) extends HouseTableRepository directly and delegates to JPA. So houseTableRepository.save() in the test goes:

doCommit → HouseTablesH2Repository.save() → JPA.merge() on H2 → done

Production goes:

doCommit → HouseTableRepositoryImpl.save() → HTTP PUT → HTS service →
UserTablesServiceImpl.putUserTable() → UserTableVersionMapper.toVersion() (string-CAS on metadataLocation) →
UserTableRow JPA.merge() on MySQL (with @Version optimistic lock) → done

Two completely different code paths past houseTableRepository.save(). The test skips the HTS service, the version mapper, the HTTP boundary, the real DB.

What the race in the test actually proves

Two threads both reach HouseTablesH2Repository.save() at the same (tableId, databaseId). JPA merges both rows. No CAS, last writer wins. The first writer's metadata.json is
orphaned. That's the 8-of-10 you see.

This proves: the H2 fixture is missing the CAS that production's HTS service has. It does not prove that the production path has a race — because the production path has
UserTableVersionMapper.toVersion() plus @Version on UserTableRow, both of which are in the call chain only when the real HTS service runs.

mkuchenbecker added a commit to mkuchenbecker/openhouse that referenced this pull request May 28, 2026
…remove dead @Version code

PR linkedin#614's test reproduces a race against the H2 test fixture (HouseTablesH2Repository, @primary in tests). The fixture short-circuits the production HTS service: no UserTableVersionMapper, no UserTableRow @Version. The race it catches doesn't exist on the prod MySQL+HTS path. Removing.

Restore testDoCommitAbortsOnStaleClaimedBase from PR linkedin#611 (production-realistic shape, writes a real metadata.json on disk so the URI-normalized path comparison runs).

Remove the commented-out @Version block on HouseTable; that path was the wrong layer.
mkuchenbecker added a commit that referenced this pull request May 29, 2026
## Summary

Adds a catalog-level CAS in `OpenHouseInternalTableOperations.doCommit`
that aborts when the writer's declared base (`COMMIT_KEY`) does not
match the catalog's current persisted base — closing the
`BaseTransaction.applyUpdates` silent-rebase variant of the stale-base
lost-update bug (incident-12185).

## Mechanism

1. A writer's transaction stages `COMMIT_KEY = T_X` and a
`SNAPSHOTS_JSON_KEY` payload computed against `T_X`.
2. A racing commit advances the catalog `T_X → T_Y`, adding a snapshot.
3. `BaseTransaction.applyUpdates` silently refreshes the in-flight base
to `T_Y` and re-applies the staged update — re-stamping `COMMIT_KEY =
T_X` on top of `T_Y` while leaving the stale `SNAPSHOTS_JSON_KEY`.
4. `doCommit` runs with `base = T_Y` but `COMMIT_KEY = T_X`. Without the
check, the subtractive snapshot merge computes `toRemove =
T_Y.snapshots() − stale payload = {racing snapshot}` and silently drops
it.

The fix reads `COMMIT_KEY` before `failIfRetryUpdate` strips it,
URI-normalizes both paths via Hadoop `Path`, and throws
`CommitFailedException` on mismatch so Iceberg retries against the fresh
base.

## Scope of the check

- **Aborts** when `COMMIT_KEY` is a concrete location differing from the
catalog base, or is `INITIAL_VERSION`.
- **Does not defend** commits that leave `COMMIT_KEY` unset — wholesale
replace/create (`replaceTable`, stage-create, stage-replace) are
authoritative over the snapshot set, so there is no stale base to
compare against.

## Changes

Bug fix + unit test, internal catalog only (2 files). `doCommit` may now
throw `CommitFailedException` on a stale-base commit that previously
silently dropped a racing snapshot.

## Testing

Unit test `testDoCommitMustAbortStaleBaseRebaseToPreventSnapshotLoss` in
`OpenHouseInternalTableOperationsTest` round-trips the post-refresh
`TableMetadata` through `TableMetadataParser` so
`base.metadataFileLocation()` is non-null (matching a loaded-from-disk
base) and the URI-normalized comparison runs. Asserts
`CommitFailedException` is thrown and `houseTableRepository.save` is
never invoked.

```
./gradlew :iceberg:openhouse:internalcatalog:test --tests "*OpenHouseInternalTableOperationsTest"
```

### Not covered

A Spark concurrent-insert behavior test
(`SparkConcurrentInsertFunctionalTest`, PR #614) was explored and
removed: it only reproduces against the H2 test fixture, not production
MySQL+HTS. A prod-realistic black-box repro would need the real HTS app
or a deployed instance.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: mkuchenbecker <mkuchenbecker@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants