fix(transfers): normalize OwnerKey joins with mapper and collision guard#482
fix(transfers): normalize OwnerKey joins with mapper and collision guard#482
Conversation
- add owners_ownerkey_mapper.json for canonical OwnerKey mapping - apply canonicalization + casefold normalization before OwnerLink join - fail fast on normalization collisions with actionable logging - document the mapping file in README
There was a problem hiding this comment.
Pull request overview
Adds case-insensitive, collision-guarded OwnerKey joining for contact transfers by introducing an optional canonicalization mapper and normalizing keys prior to joining.
Changes:
- Introduced
owners_ownerkey_mapper.jsonand loader hook inContactTransfer.__init__. - Normalized
OwnerKeyvalues with canonicalization +strip().casefold()and added a collision guard. - Updated join to use
ownerkey_normand documented mapper usage in README.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| transfers/data/owners_ownerkey_mapper.json | Adds a canonicalization mapping to resolve known OwnerKey case/spelling variants. |
| transfers/contact_transfer.py | Loads mapper, canonicalizes + normalizes OwnerKeys, checks collisions, and joins on normalized key. |
| README.md | Documents remediation steps when collisions are detected. |
transfers/contact_transfer.py
Outdated
| odf["ownerkey_canonical"] = odf[owner_key_col].map( | ||
| lambda v: self._ownerkey_mapper.get(v, v) | ||
| ) | ||
| ldf["ownerkey_canonical"] = ldf[link_owner_key_col].map( | ||
| lambda v: self._ownerkey_mapper.get(v, v) | ||
| ) |
There was a problem hiding this comment.
Using .map(lambda ...) applies a Python callback per row, which is slower on large DataFrames. Prefer a vectorized approach such as mapping with the dict directly and filling missing values (e.g., s.map(mapper).fillna(s)), applied to both series.
| odf["ownerkey_canonical"] = odf[owner_key_col].map( | |
| lambda v: self._ownerkey_mapper.get(v, v) | |
| ) | |
| ldf["ownerkey_canonical"] = ldf[link_owner_key_col].map( | |
| lambda v: self._ownerkey_mapper.get(v, v) | |
| ) | |
| mapped_odf = odf[owner_key_col].map(self._ownerkey_mapper) | |
| odf["ownerkey_canonical"] = mapped_odf.fillna(odf[owner_key_col]) | |
| mapped_ldf = ldf[link_owner_key_col].map(self._ownerkey_mapper) | |
| ldf["ownerkey_canonical"] = mapped_ldf.fillna(ldf[link_owner_key_col]) |
- Drop overlapping OwnerLink columns before joining on normalized OwnerKey to prevent “columns overlap” errors during contact transfer.
…ation - Replaced next(...endswith()) logic with explicit name matching and count validation for OwnerKey. - Impact: Prevents silent data corruption caused by non-deterministic column selection when multiple similar keys exist.
Logs the expected path on FileNotFoundError to make missing mappings visible and easier to diagnose. Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…tact' into kas-bdms-537-resolve-missing-contact
transfers/contact_transfer.py
Outdated
| col for col in odf.columns if col.lower().endswith("ownerkey") | ||
| ) | ||
| link_owner_key_col = next( | ||
| col for col in ldf.columns if col.lower().endswith("ownerkey") | ||
| ) |
There was a problem hiding this comment.
next(...) without a default will raise StopIteration with a confusing traceback if no matching column exists. Consider providing a default (e.g., None) and raising a ValueError that clearly explains which expected column pattern was missing and what columns were found.
| col for col in odf.columns if col.lower().endswith("ownerkey") | |
| ) | |
| link_owner_key_col = next( | |
| col for col in ldf.columns if col.lower().endswith("ownerkey") | |
| ) | |
| (col for col in odf.columns if col.lower().endswith("ownerkey")), | |
| None, | |
| ) | |
| if owner_key_col is None: | |
| raise ValueError( | |
| "Expected a column whose name ends with 'ownerkey' " | |
| "(case-insensitive) in OwnersData, but found columns: " | |
| f"{list(odf.columns)}" | |
| ) | |
| link_owner_key_col = next( | |
| (col for col in ldf.columns if col.lower().endswith("ownerkey")), | |
| None, | |
| ) | |
| if link_owner_key_col is None: | |
| raise ValueError( | |
| "Expected a column whose name ends with 'ownerkey' " | |
| "(case-insensitive) in OwnerLink, but found columns: " | |
| f"{list(ldf.columns)}" | |
| ) |
| odf["ownerkey_norm"] = ( | ||
| odf["ownerkey_canonical"] | ||
| .fillna("") | ||
| .astype(str) | ||
| .str.strip() | ||
| .str.casefold() | ||
| .replace({"": pd.NA}) | ||
| ) | ||
| ldf["ownerkey_norm"] = ( | ||
| ldf["ownerkey_canonical"] | ||
| .fillna("") | ||
| .astype(str) | ||
| .str.strip() | ||
| .str.casefold() | ||
| .replace({"": pd.NA}) | ||
| ) |
There was a problem hiding this comment.
The helper columns ownerkey_canonical and ownerkey_norm are added to odf and can leak into subsequent processing/output. If they’re only intended for joining, consider dropping them after the join (or renaming to clearly-internal names like _ownerkey_norm) to keep the resulting dataframe schema stable.
Use vectorized replacement instead of per-row lambdas. Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Stops if multiple case-variant OwnerKey columns are present. Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| with open(co_to_org_mapper_path, "r") as f: | ||
| self._co_to_org_mapper = json.load(f) | ||
|
|
||
| ownerkey_mapper_path = get_transfers_data_path("owners_ownerkey_mapper.json") | ||
| try: | ||
| with open(ownerkey_mapper_path, "r") as f: | ||
| self._ownerkey_mapper = json.load(f) | ||
| except FileNotFoundError: | ||
| logger.warning( | ||
| "Owner key mapper file not found at '%s'; proceeding with empty owner key mapping.", | ||
| ownerkey_mapper_path, | ||
| ) | ||
| self._ownerkey_mapper = {} |
There was a problem hiding this comment.
JSON files should be opened with an explicit encoding to avoid platform-dependent defaults. Use encoding=\"utf-8\" for both mapper reads.
| odf["ownerkey_norm"] = ( | ||
| odf["ownerkey_canonical"] | ||
| .fillna("") | ||
| .astype(str) | ||
| .str.strip() | ||
| .str.casefold() | ||
| .replace({"": pd.NA}) | ||
| ) | ||
| ldf["ownerkey_norm"] = ( | ||
| ldf["ownerkey_canonical"] | ||
| .fillna("") | ||
| .astype(str) | ||
| .str.strip() | ||
| .str.casefold() | ||
| .replace({"": pd.NA}) | ||
| ) |
There was a problem hiding this comment.
The normalization logic is duplicated for odf and ldf, which increases the chance of future drift (e.g., one side adding a transform but not the other). Consider extracting this into a small helper (e.g., _normalize_ownerkey(series)) and using it for both frames.
Why
This PR addresses the following problem / context:
How
Implementation summary - the following was changed / added / removed:
Notes
Any special considerations, workarounds, or follow-up work to note?
transfers/data/owners_ownerkey_mapper.jsonwhen new OwnerKey collisions are discovered.OwnerKeycase doesn’t matchOwnerLink#479.