Skip to content

bug: build_pairs_row() misclassifies OM records as IEM when derived issued_at is present #67

@zax0rz

Description

@zax0rz

Problem

build_pairs_row() in packages/core/src/mostlyright/_internal/_pairs.py incorrectly classifies Open-Meteo (OM) forecast records as IEM MOS records when the OM records carry a derived issued_at field.

The current code splits the forecast list using issued_at presence as the sole discriminator:

# Line ~264-265
iem_records = [r for r in forecasts if r.get("issued_at")]
om_records = [r for r in forecasts if not r.get("issued_at")]

This logic was correct when Open-Meteo records truly had no issued_at, but Phase 20+ OM records now carry a derived issued_at (e.g. to support cycle-math in the Open-Meteo fetcher). As a result, any OM record with a populated issued_at is routed into iem_records, passed to _select_best_run() and _aggregate_fcst_temps_iem(), and treated as IEM MOS data.

Reproduction

  1. Fetch forecasts for a station where both IEM MOS and Open-Meteo data are available (e.g. forecast_sources=["iem_mos", "open_meteo"]).
  2. Ensure the Open-Meteo records include a derived issued_at (this is the default behavior in Phase 20+ for training mode / previous-runs caching).
  3. Call build_pairs_row() with the combined forecast list.
  4. Observe that OM records with issued_at are missing from om_records and incorrectly appear in iem_records.

Minimal conceptual trigger:

forecasts = [
    {"source": "iem.archive", "model": "NBS", "issued_at": "2026-06-04T12:00:00Z", "valid_at": "...", "temperature_f": 72},
    {"source": "open_meteo.previous_runs", "model": "ecmwf_ifs04", "issued_at": "2026-06-04T06:00:00Z", "valid_at": "...", "temperature_c": 22},
]
# iem_records = both rows; om_records = []
# The OM row gets fed to _aggregate_fcst_temps_iem() which looks for temperature_f (None)

Root Cause

The discriminator assumes:

  • issued_at present → IEM MOS
  • issued_at absent → Open-Meteo

This assumption is violated by Phase 20+ Open-Meteo records, which have source values such as open_meteo.previous_runs, open_meteo.single_run, open_meteo.live, or open_meteo.seamless, and carry a derived issued_at.

The correct discriminator is the source field, which is authoritative per the schema definitions in packages/core/src/mostlyright/core/schemas/forecast.py and packages/core/src/mostlyright/_internal/specs/forecast_series.json.

Suggested Fix

Replace the issued_at-based split with a source-based split:

# Before
iem_records = [r for r in forecasts if r.get("issued_at")]
om_records = [r for r in forecasts if not r.get("issued_at")]

# After
iem_records = [r for r in forecasts if not r.get("source", "").startswith("open_meteo")]
om_records = [r for r in forecasts if r.get("source", "").startswith("open_meteo")]

This aligns with the schema contract where IEM records have source="iem.archive" and OM records have source values prefixed with open_meteo.

Impact

  • Data misclassification: Open-Meteo forecasts with derived issued_at are processed through IEM MOS aggregation paths, causing temperature_f lookups to fail (OM stores temperature_c), resulting in silently null forecast temperatures.
  • Training data quality: If IEM MOS is the preferred source, the fallback to OM is skipped entirely when OM records are misclassified as IEM, potentially yielding null forecasts for dates where valid OM data exists.
  • Affects Phase 20+ workflows that combine or cache both forecast sources.

Files Affected

  • packages/core/src/mostlyright/_internal/_pairs.pybuild_pairs_row() function

Severity

Medium-High — silently corrupts forecast data for multi-source callers in Phase 20+.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions