fix(pairs): discriminate OM/IEM forecasts by source, not issued_at (#67)#69
Merged
Merged
Conversation
Phase 20+ Open-Meteo rows carry a derived issued_at, so the old issued_at-presence split in build_pairs_row() misrouted them into the IEM MOS aggregation path — silently nulling forecast temps and polluting IEM run-selection when both sources are combined. Discriminate by the authoritative source field instead (open_meteo* -> Open-Meteo, else IEM), matching the contract research._fetch_open_meteo_range already documents. Also teach _aggregate_fcst_temps_openmeteo to fall back to a pre-converted temperature_f so source-discriminated rows from the research() wrapper (which emits temperature_f, not temperature_c) aggregate correctly instead of returning null. TDD: 3 new regression tests in test_pairs.py cover the derived-issued_at classification, IEM-still-preferred, and the research-wrapper temperature_f shape.
Codex review caught a regression: research._fetch_open_meteo_range emits OM rows carrying pop_6hr_pct / qpf_6hr_in (not precipitation_probability_pct and no QPF read). Pre-#67 those rows flowed through the IEM branch, which set both fields; after source-routing they hit the OM branch, which only read precipitation_probability_pct and never set QPF — silently nulling precip columns for research(forecast_source="open_meteo"). OM branch now accepts the pop_6hr_pct alias (explicit None-checks keep a valid 0.0) and sums qpf_6hr_in over the window, matching IEM semantics. +2 regression tests.
…odex P2) Codex flagged that the pure source-prefix split regressed the previously documented Open-Meteo shape (no source, no issued_at, temperature_c) to the IEM branch -> null temps. Extract _is_open_meteo_record(): source prefixed open_meteo OR (no source AND no issued_at). Real IEM rows always carry an issued_at, so a record missing both can only be legacy OM. +1 regression test.
… codex P2) Codex flagged leakage-safety provenance loss: Phase 20+ OM rows carry a derived issued_at; pre-#67 the IEM branch set fcst_issued via _select_best_run, but after source routing the OM branch left fcst_issued_at None for forecast_source="open_meteo". Restore it as the most-recent OM issued_at at-or-before market close — never leaking a run issued after settlement. +2 regression tests.
…P1 leakage) Codex P1: my provenance fix filtered issued_at only for the displayed timestamp; the OM aggregation still summed window rows issued AFTER market close, leaking their temp/POP/QPF into the training pair (lookahead). Pre-#67 _select_best_run applied this cutoff; the OM branch did not. Now filter window_om by issued_at <= market_close before aggregating temps/POP/QPF/ issued_at. Legacy source-less rows (no issued_at) are kept — nothing to leak. Strengthened test asserts the after-close temp is excluded, not just hidden.
|
Parity ticket gate: PASSED See |
helloiamvu
added a commit
that referenced
this pull request
Jun 6, 2026
…h() docs (#52) Bump all three packages 1.5.2 -> 1.5.3. - #67 (#69): build_pairs_row() now discriminates OM/IEM forecasts by source instead of issued_at presence, fixing silent null temps + IEM run-selection pollution for Phase 20+ multi-source callers; closes a lookahead-leakage path for after-close OM runs. Parity gate unaffected. - #52 (#70): research() docstring clarifies daily-row return granularity.
This was referenced Jun 6, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #67.
build_pairs_row()split forecast records byissued_atpresence, but Phase 20+ Open-Meteo rows carry a derivedissued_at, so they were misrouted into the IEM MOS aggregation path — silently nulling forecast temps and polluting IEM run-selection when both sources are combined.Discriminate by the authoritative
sourcefield instead (open_meteo*→ Open-Meteo, else IEM), matching the contractresearch._fetch_open_meteo_rangealready documents ("discriminates via row.get('source')").Changes
_is_open_meteo_record():sourceprefixedopen_meteoOR legacy no-source/no-issued_atshape (real IEM always carriesissued_at, so a record missing both can only be legacy OM — keeps backward-compat).temperature_f(the shape the research wrapper emits) so source-routed rows don't aggregate to null.pop_6hr_pct/qpf_6hr_inandfcst_issued_atprovenance that previously came via the IEM branch._select_best_run's cutoff. The SDK's temporal-safety invariant.Tests (TDD)
8 new regression tests in
test_pairs.pycovering: derived-issued_atclassification, IEM-still-preferred, research-wrappertemperature_fshape, pop/qpf survival, zero-POP, legacy source-less shape, issued_at provenance, and after-close leakage exclusion. Parity gate (tests/test_parity.py) unchanged.Review
Codex (gpt-5.5, medium) — 4 iterations: 1 P1 (leakage) + 3 P2 (precip regression, legacy shape, provenance) all fixed; final review clean.
TS Parity
Python-internal forecast-join discriminator fix; no public API surface change (same
research()columns). The TS twin'sbuildPairsRowequivalent should get a parity ticket per CROSS-SDK-SYNC.md to apply the same source-based split — tracked separately.