Conversation
…fer input validation
There was a problem hiding this comment.
Pull request overview
This PR implements extensive improvements to data transfer verification and performance optimizations for the BDMS-576 project. The PR focuses on verifying successful data migration from SQL Server to PostgreSQL and includes significant performance enhancements to transfer scripts, along with schema relaxations to accommodate legacy data quality issues.
Changes:
- Added comprehensive transfer verification framework with
TransferResulttypes, comparison specs, and a builder for generating transfer summary reports - Implemented batch insert optimizations across multiple transfer modules (thing_transfer, geologic_formation, link_ids, contact_transfer) to improve performance
- Relaxed database constraints (measuring_point_height, deployment.installation_date, well_screen depths, address fields) to nullable to accommodate incomplete legacy data
- Enhanced data quality handling with email/phone field detection, measuring point height rounding, and improved contact deduplication
Reviewed changes
Copilot reviewed 44 out of 44 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| transfers/well_transfer.py | Added progress tracking with transferred_count for parallel wells transfer |
| transfers/waterlevels_transfer.py | Changed logging levels, renamed stats field to rows_missing_participants, continued processing without field_event_participants |
| transfers/util.py | Added _round_sig_figs function to round estimated measuring point heights to 2 significant figures |
| transfers/transferer.py | Removed error logging for missing sample_pt_id matches |
| transfers/transfer_results_types.py | New file defining TransferResult dataclasses for verification |
| transfers/transfer_results_specs.py | New file with 29 transfer comparison specs and filter functions |
| transfers/transfer_results_builder.py | New file implementing TransferResultsBuilder for comparing source CSV to destination DB |
| transfers/transfer.py | Made bucket upload conditional on SAVE_TO_BUCKET env variable |
| transfers/thing_transfer.py | Refactored to batch insert pattern with bulk Location/Thing/Notes/DataProvenance inserts |
| transfers/surface_water_data.py | Removed null thing_id checks, allowing nullable thing_id |
| transfers/sensor_transfer.py | Changed installation_date logging from critical to warning, allows NULL |
| transfers/relaxed_constraints.md | New documentation file listing 10 relaxed constraints |
| transfers/logger.py | Added OCO_LOG_CONTEXT env support for cli vs transfer log segregation |
| transfers/link_ids_transfer.py | Refactored to batch insert pattern with bulk ThingIdLink inserts |
| transfers/geologic_formation_transfer.py | Refactored to bulk upsert with on_conflict_do_nothing |
| transfers/contact_transfer.py | Added contact caching, phone-in-email detection, email normalization |
| transfers/README.md | New README documenting transfer structure and performance rules |
| tests/unit/test_contact_transfer_email_utils.py | New tests for email normalization and phone detection |
| tests/test_util.py | Added tests for measuring point height rounding |
| tests/test_cli_commands.py | Added test for transfer-results CLI command |
| tests/features/environment.py | Added DROP_AND_REBUILD_DB env guard for test setup/teardown |
| tests/README.md | New README documenting test structure |
| schemas/thing.py | Made measuring_point_height nullable, disabled depth validations for transfer |
| schemas/sample.py | Made field_event_participant_id nullable |
| schemas/deployment.py | Made installation_date nullable |
| schemas/contact.py | Made address city, state, postal_code nullable |
| pyproject.toml | Added transfers to packages list |
| db/thing.py | Made screen_depth_top and screen_depth_bottom nullable |
| db/measuring_point_history.py | Made measuring_point_height nullable |
| db/deployment.py | Made installation_date nullable |
| db/contact.py | Made address city, state, postal_code nullable |
| db/README.md | New README documenting DB structure |
| core/lexicon.json | Added well_construction_method to Unknown term categories |
| cli/cli.py | Added transfer-results and compare-duplicated-welldata commands |
| cli/README.md | New README documenting CLI structure |
| api/README.md | New README documenting API structure |
| alembic/versions/*.py | Added 5 new migrations for nullable changes, deleted 1 merge migration, updated down_revision for 1 migration |
alembic/versions/50d1c2a3b4c5_add_unique_index_ngwmn_wellconstruction.py
Show resolved
Hide resolved
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 41ff8de1ee
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…pdating row processing
…ointHistory and remove depth validation
| return mphs, mph_descs, start_dates, end_dates | ||
|
|
||
|
|
||
| def _round_sig_figs(value: float, sig_figs: int) -> float: |
There was a problem hiding this comment.
The _round_sig_figs function has a parameter type annotation of value: float but handles None, NaN, non-numeric values, and various edge cases. The type annotation should be value: float | None or value: Any to accurately reflect the accepted input types.
alembic/versions/a1b2c3d4e5f7_make_deployment_installation_date_nullable.py
Show resolved
Hide resolved
| Address.postal_code is nullable | ||
| Thing measuring_point_height is nullable | ||
| ValidateWell, depth validation removed | ||
| Deployment.installation_date is nullable | ||
| CreateWellScreen depth validation removed | ||
| FieldEventParticipants not required | ||
| screen_depth_bottom is nullable | ||
| screen_depth_top is nullable | ||
| city nullable | ||
| state nullable No newline at end of file |
There was a problem hiding this comment.
FYI @ksmuczynski. All of these constraints should be reenabled after the NMA migration is completed
|
I agree, in the interest of transferring legacy data successfully and in its entirety, allowing |
marissafichera
left a comment
There was a problem hiding this comment.
Looks bueno. Thanks!
…issing value checks
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 47 out of 48 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
schemas/thing.py:57
ValidateWell.validate_values()now returns early, leaving the depth-validation logic below permanently unreachable. If the intent is a temporary transfer-only relaxation, consider guarding the validation behind a feature flag/env var (or a schema option) and removing the dead code path to avoid confusion and accidental reintroduction issues later.
@model_validator(mode="after")
def validate_values(self):
# todo: reenable depth validation. removed for transfer
return self
if self.hole_depth is not None:
if self.well_depth is not None and self.well_depth > self.hole_depth:
raise ValueError(
"well depth must be less than than or equal to hole depth"
)
| if contacts_to_create: | ||
| try: | ||
| created_contact_ids = ( | ||
| session.execute( | ||
| insert(Contact).returning(Contact.id), | ||
| contacts_to_create, | ||
| ) | ||
|
|
||
| self._created_contacts[(name, organization)] = contact | ||
| .scalars() | ||
| .all() | ||
| ) | ||
| except Exception as e: | ||
| logger.critical( | ||
| "Contact insert failed for PointID=%s, GlobalID=%s: %s", | ||
| row.PointID, | ||
| row.GlobalID, | ||
| str(e), | ||
| ) | ||
| else: |
There was a problem hiding this comment.
In _get_field_event_participant_ids, if the bulk insert(Contact) fails, the exception is logged but the session is not rolled back. After a DB/SQLAlchemy execution error, the session is typically left in a failed transaction state and later inserts in the same group will raise PendingRollbackError. Roll back (and ideally capture an error) in the except branch, and consider returning an empty participant list for that row/group so downstream inserts can continue safely.
| def _transfer_hook(self, session: Session): | ||
| self._build_contact_caches(session) | ||
|
|
||
| groups = self._get_group() | ||
| pointids = [ | ||
| idx[0] if isinstance(idx, tuple) else idx for idx in groups.groups.keys() | ||
| ] | ||
| things = session.query(Thing).filter(Thing.name.in_(pointids)).all() | ||
| thing_by_name = {thing.name: thing for thing in things} | ||
| logger.info( |
There was a problem hiding this comment.
_build_contact_caches() stores live Contact ORM objects for reuse, but _group_step can still session.rollback() on exceptions (later in the flow). A rollback can discard newly-created (uncommitted) contacts and leave these cached ORM instances out of sync with what’s actually persisted, which can break dedupe/association logic in subsequent iterations. To keep caches consistent, consider isolating per-row/per-group work in a nested transaction, or clearing/rebuilding the caches after any rollback.
Use this updated PR description (adds the recent work):
Summary
This PR delivers a broad transfer-quality and operability update against
staging, including:WellData+ well smoke test)What Changed
1. Transfer results / comparison framework
Added a full comparison/reporting subsystem to evaluate source-vs-destination transfer outcomes:
transfers/transfer_results_builder.py(new)transfers/transfer_results_specs.py(new)transfers/transfer_results_types.py(new)cli/cli.py:transfer-resultscommandThis produces summary artifacts for transfer verification and agreed/missing/extra analysis.
2. New CLI diagnostics
Enhanced
cli/cli.pywith first-class commands:transfer-resultscompare-duplicated-welldata(new)WellData.PointIDrowsPointID)--pointidfilterswell-smoke-test(new, first-class)--all-wells/--sampledmode to run entire selected population (e.g. all agreed wells)transfers/metrics3. Transfer logging split
Separated CLI logs from transfer-script logs:
transfers/logger.pysupports context-aware log routing viaOCO_LOG_CONTEXTcli/cli.pysetsOCO_LOG_CONTEXT=cliResult:
cli/logs/cli_*.logtransfers/logs/transfer_*.logcli_logsvstransfer_logs)4. Nullable constraints + migration chain updates
Schema/migration updates to support real-world nulls seen during transfer:
New migrations:
8c9d0e1f2a3b_make_measuring_point_height_nullable.py9a0b1c2d3e4f_make_address_postal_code_nullable.pya1b2c3d4e5f7_make_deployment_installation_date_nullable.pyb3c4d5e6f7a8_make_wellscreen_depths_nullable.pyc4d5e6f7a8b9_make_address_city_state_nullable.pyAlso:
43bc34504ee6_merge_migrations_after_staging_merge.py50d1c2a3b4c5_add_unique_index_ngwmn_wellconstruction.pyModel/schema alignment:
db/contact.py,schemas/contact.py(city/state/postal_code nullable handling)db/deployment.py,schemas/deployment.pydb/measuring_point_history.py,schemas/thing.py,db/thing.py5. Contact transfer robustness
transfers/contact_transfer.pyupdates include:Email:prefixuser@x.com.)ThingContactAssociationrows on rerunsAdded unit test:
tests/unit/test_contact_transfer_email_utils.py6. Transfer performance/hardening updates
Improvements across transfer modules for throughput and reliability, including updates in:
transfers/geologic_formation_transfer.pytransfers/link_ids_transfer.pytransfers/thing_transfer.pytransfers/sensor_transfer.pytransfers/surface_water_data.pytransfers/waterlevels_transfer.pytransfers/well_transfer.pytransfers/util.pytransfers/transfer.pytransfers/transferer.py(Shift toward batch/core insert paths and reduced per-row overhead where applicable.)
7. Documentation additions
Added directory READMEs:
api/README.mdcli/README.mddb/README.mdtransfers/README.mdtests/README.mdtransfers/relaxed_constraints.md8. Smoke-test parity fixes (recent)
transfers/smoke_test.pywas updated to match transfer behavior and remove false positives:IncompleteNMAPhone)Validation / Verification Performed
transfer-resultscompare-duplicated-welldatawell-smoke-test(including--all-wells)WellDatacomparison outputsampled_wells=9887presence_mismatches=0value_mismatches=0failed_wells=0Note: full pytest suite is environment-dependent (DB-backed) and was not fully executable in sandboxed context.
Migration Notes
Before running transfers with this branch:
source .venv/bin/activate alembic upgrade headRisk / Impact
Suggested Reviewer Focus
transfers/transfer_results_specs.pyagreed/missing semanticstransfers/contact_transfer.pydedupe + idempotent enrichment logic for reused contactswell-smoke-testparity assumptions intransfers/smoke_test.pyWellDatacomparison command output format and utility