Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: be78f15ba7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # Upsert on unique logical key (chemistry_sample_info_id, analyte) | ||
| stmt = insert_stmt.values(chunk).on_conflict_do_update( | ||
| index_elements=["nma_GlobalID"], | ||
| index_elements=["chemistry_sample_info_id", "analyte"], |
There was a problem hiding this comment.
Handle nma_GlobalID conflicts in this upsert
Switching ON CONFLICT to ("chemistry_sample_info_id", "analyte") leaves collisions on the still-unique nma_GlobalID unhandled, so reruns can fail with duplicate key instead of updating rows. This is reachable because _row_to_dict always supplies nma_GlobalID but allows analyte to be None (which will not match the composite unique constraint), and the model still declares nma_GlobalID as unique (db/nma_legacy.py), so idempotent transfers break when a record has null analyte (or when analyte changes for an existing GlobalID).
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
Updates the Minor/Trace Chemistry transfer to align its deduplication and Postgres upsert conflict target with the table’s logical uniqueness constraint (chemistry_sample_info_id, analyte) rather than the legacy nma_GlobalID.
Changes:
- Switches
_transfer_hookupsert conflict target fromnma_GlobalIDto(chemistry_sample_info_id, analyte). - Updates row deduplication to use the same logical key as the table’s unique constraint.
- Adjusts inline documentation/comments to reflect the new upsert key.
| # Dedupe by the same logical key used by the table unique constraint. | ||
| rows = self._dedupe_rows( | ||
| row_dicts, | ||
| key=["chemistry_sample_info_id", "analyte"], | ||
| include_missing=True, | ||
| ) |
There was a problem hiding this comment.
analyte is nullable on NMA_MinorTraceChemistry, but the new dedupe/upsert key includes it and include_missing=True keeps rows where analyte is NULL. In Postgres, ON CONFLICT (chemistry_sample_info_id, analyte) will not match existing rows when analyte is NULL, so re-running the transfer will keep inserting additional NULL-analyte rows (non-idempotent growth). Consider treating missing analyte as an error/skip (capture it in _row_to_dict), or use a different conflict key/fallback path (e.g., upsert on nma_GlobalID when analyte is missing), or enforce analyte as NOT NULL in the schema if that’s valid for the dataset.
| # Upsert on unique logical key (chemistry_sample_info_id, analyte) | ||
| stmt = insert_stmt.values(chunk).on_conflict_do_update( | ||
| index_elements=["nma_GlobalID"], | ||
| index_elements=["chemistry_sample_info_id", "analyte"], |
There was a problem hiding this comment.
Because the conflict target is now (chemistry_sample_info_id, analyte), any insert row that violates another unique constraint (notably the nma_GlobalID unique constraint) will raise an error rather than being handled by the upsert. If the source extract can contain duplicate GlobalID values (even transiently), this will make the transfer fail. Consider either pre-deduping on nma_GlobalID as well, or adding a defensive validation that nma_GlobalID is unique within row_dicts before executing the batches.
| for i in range(0, len(rows), self.batch_size): | ||
| chunk = rows[i : i + self.batch_size] | ||
| logger.info(f"Upserting batch {i}-{i+len(chunk)-1} ({len(chunk)} rows)") | ||
| # Upsert on nma_GlobalID (legacy UUID PK, now UNIQUE) | ||
| # Upsert on unique logical key (chemistry_sample_info_id, analyte) | ||
| stmt = insert_stmt.values(chunk).on_conflict_do_update( | ||
| index_elements=["nma_GlobalID"], | ||
| index_elements=["chemistry_sample_info_id", "analyte"], |
There was a problem hiding this comment.
This change modifies the idempotency key for upserts, but there’s no test coverage asserting that re-running the transfer updates existing (chemistry_sample_info_id, analyte) rows rather than inserting duplicates (especially around NULL analytes). Consider adding a unit/integration test that runs _transfer_hook twice against a test DB and asserts row counts remain stable and values update as expected.
Why
This PR addresses the following problem / context:
How
Implementation summary - the following was changed / added / removed:
Notes
Any special considerations, workarounds, or follow-up work to note?