Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
dc74515
feat(25-02): unblock Wave 1 deps — [satellite] extra + satellite exce…
minereda Jun 18, 2026
dad0574
test(25-02): synthetic in-memory xr.Dataset fixtures (one per registr…
minereda Jun 18, 2026
4c31e6d
test(25-02): RED — PRODUCTS registry + ABI/lat-lon projection + parse…
minereda Jun 18, 2026
6c4839d
feat(25-02): port PRODUCTS registry + ABI/lat-lon projection + parse_…
minereda Jun 18, 2026
9294384
test(25-02): value-decode + record-build quirks, ICAO build, units-su…
minereda Jun 18, 2026
0090417
test(25-03): RED — S3/GCS whole-file transport + mirror switch + size…
minereda Jun 18, 2026
a0db947
feat(25-03): GOES S3/GCS whole-file transport — mirror switch + singl…
minereda Jun 18, 2026
15dabea
test(25-03): RED — satellite dedup (mirror-invariant) + validate disp…
minereda Jun 18, 2026
c2f002b
feat(25-03): satellite merge policy — dedup (mirror-invariant) + vali…
minereda Jun 18, 2026
26adc9d
test(25-03): RED — satellite cache tier (path hardening + mirror-inva…
minereda Jun 18, 2026
6ef8b71
feat(25-03): satellite cache tier — path hardening + direct atomic wr…
minereda Jun 18, 2026
fdc0b11
test(25-04): RED — public satellite() fetcher + mirror enum + qc redu…
minereda Jun 18, 2026
713b561
feat(25-04): public satellite() fetcher — mirror enum + lazy-guard + …
minereda Jun 18, 2026
badcd01
test(25-04): leakage wiring — event/knowledge time + typed as_of via …
minereda Jun 18, 2026
c5515c2
test(25-04): harden satellite() coverage — getattr guard, empty-resol…
minereda Jun 18, 2026
3b591f4
test(25-05): RED — backfill orchestrator (slices, direct atomic write…
minereda Jun 18, 2026
370f81a
feat(25-05): backfill orchestrator — slices, direct atomic write, Thr…
minereda Jun 18, 2026
f76805f
test(25-05): RED — resume hardening + single-writer lock + argparse C…
minereda Jun 18, 2026
c202dde
feat(25-05): resume layer + single-writer lock + argparse CLI (--mirr…
minereda Jun 18, 2026
f4c187c
test(25-05): RED — empirical rate-limit/throughput probe (D10 SAT-25-11)
minereda Jun 18, 2026
4fd459e
feat(25-05): empirical rate-limit/throughput probe (D10 SAT-25-11) — …
minereda Jun 18, 2026
ef25872
test(25-05): RED — docs/satellite.md + README satellite section asser…
minereda Jun 18, 2026
7020127
feat(25-05): docs/satellite.md + README satellite section
minereda Jun 18, 2026
2150623
test(25): RED — live satellite() path dedups reprocessed scans (P2-a)
minereda Jun 18, 2026
cdf3119
fix(25): dedup live satellite() rows first-seen-wins (P2-a)
minereda Jun 18, 2026
93df89b
test(25): RED — register SatelliteSchema + source-identity + codegen …
minereda Jun 18, 2026
2fe093b
fix(25): land + register SatelliteSchema, wire codegen (P2-b)
minereda Jun 18, 2026
61a0174
test(25): RED — satellite() runs schema source-identity validation (P…
minereda Jun 18, 2026
5e182bd
fix(25): satellite() runs schema.satellite.v1 source-identity validat…
minereda Jun 18, 2026
bb818e0
test(25): cover _probe.py idempotent SOURCE-LIMITS rewrite + derive e…
minereda Jun 18, 2026
51345bb
test(25): cover _goes_s3 network error paths — fail-fast/retry/exhaus…
minereda Jun 18, 2026
4d69614
style(25): ruff format satellite/__init__.py validation helper
minereda Jun 18, 2026
158065e
fix(25): serialize write_satellite_cache read-modify-write under one …
minereda Jun 18, 2026
64948cf
test(25): cover per-attr GoesDataCorruptError branches in projection …
minereda Jun 18, 2026
490b76b
ci(25): add measurable >=80% line-coverage lane for 4 satellite modules
minereda Jun 18, 2026
cbbce72
test(satellite): RED for 4 Phase 25 GOES backfill findings
minereda Jun 18, 2026
68867c6
fix(satellite): picklable process-pool worker + full-identity resume key
minereda Jun 18, 2026
0a3e1e9
fix(satellite): honor backfill --out for the parquet partition write
minereda Jun 18, 2026
7074b1c
fix(satellite): accept 3D profile shapes in the pre-pixel-read shape …
minereda Jun 18, 2026
2404ea4
style(satellite): ruff format + sync resume-key docstring to full ide…
minereda Jun 18, 2026
7ac21d3
test(satellite): RED — 3D profile gate must accept TRAILING pressure …
minereda Jun 18, 2026
919aa5e
fix(satellite): 3D profile gate validates LEADING spatial dims, TRAIL…
minereda Jun 18, 2026
c795318
test(satellite): RED — drop scans outside the requested event-time wi…
minereda Jun 18, 2026
6114467
fix(satellite): filter emitted scans to the requested event-time wind…
minereda Jun 18, 2026
318bc33
test(satellite): RED — backfill must not mark current/future months c…
minereda Jun 18, 2026
efa75f4
fix(satellite): only mark fully-elapsed/written backfill slices compl…
minereda Jun 18, 2026
0b4fff6
test(satellite): RED — frame must stamp df.attrs['retrieved_at'] (P2-4)
minereda Jun 18, 2026
2f1608c
fix(satellite): stamp df.attrs['retrieved_at'] with the real fetch ti…
minereda Jun 18, 2026
52a2fe3
style(satellite): ruff format + en-dash fix on P2-1/P2-2 lines
minereda Jun 18, 2026
958b580
test(satellite): guard extra-dependent tests so the no-extra CI fast-…
minereda Jun 18, 2026
161504d
style(satellite): ruff format the test-guard import lines
minereda Jun 18, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 36 additions & 0 deletions .coveragerc-satellite
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Phase 25 satellite coverage lane (P2 verification-completeness fix).
#
# The default [tool.coverage.run] in pyproject.toml sets ``branch = true``,
# which forces coverage's C tracer. Any subprocess that imports numpy/pandas
# under the C tracer raises ``numpy: cannot load module more than once per
# process``, so the four heaviest new satellite modules — _goes_extract,
# _goes_s3, _internal/merge/satellite, core/schemas/satellite — could not be
# coverage-measured at all (the >=80% gate was inferred, not proven).
#
# This config measures LINE coverage only (``branch = false``), which lets
# coverage use the sys.monitoring (sysmon) backend instead of the C tracer.
# sysmon does not trigger the numpy single-load reload error, so the four
# modules become measurable. Line coverage is exactly what the CLAUDE.md
# ">=80% coverage on new code" gate requires.
#
# Drive it with ``COVERAGE_CORE=sysmon`` (see the ``satellite-coverage`` CI
# lane in .github/workflows/test.yml). Run it as its OWN pytest process so the
# satellite test modules import numpy/pandas exactly once.
[run]
branch = false
# Path-based include (not dotted module ``source``): the four target modules
# are imported lazily inside test functions, and under the sysmon backend the
# dotted-name ``source`` filter does not reliably attach to a module that was
# imported after coverage start. Filesystem ``include`` globs match on the
# resolved file path instead, which is stable regardless of import timing.
include =
*/mostlyright/weather/_fetchers/_goes_extract.py
*/mostlyright/weather/_fetchers/_goes_s3.py
*/mostlyright/_internal/merge/satellite.py
*/mostlyright/core/schemas/satellite.py

[report]
show_missing = true
skip_covered = false
# 80% line floor on the four otherwise-unmeasurable satellite modules.
fail_under = 80
37 changes: 37 additions & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -236,3 +236,40 @@ jobs:
--cov-fail-under=85 \
--cov-report=term-missing:skip-covered \
-q

# Phase 25 (P2 verification-completeness fix): the default coverage-gate job
# above runs ``--cov-branch``, which forces coverage's C tracer. Any
# subprocess importing numpy/pandas under the C tracer raises
# ``numpy: cannot load module more than once per process``, so the four
# heaviest new satellite modules (_goes_extract, _goes_s3,
# _internal/merge/satellite, core/schemas/satellite) could not be
# coverage-measured — the >=80% gate on them was inferred, not proven. This
# dedicated lane measures LINE coverage under the sys.monitoring (sysmon)
# backend (branch=false in .coveragerc-satellite), which does NOT trip the
# numpy reload, and fails the build if any of the four drops below 80%.
satellite-coverage:
needs: changes
runs-on: ubuntu-latest
steps:
- name: No-op (no Python-relevant changes)
if: needs.changes.outputs.py != 'true'
run: echo "No Python-relevant changes in this PR; satellite-coverage is a no-op success."

- uses: actions/checkout@v4
if: needs.changes.outputs.py == 'true'

- name: Install uv
if: needs.changes.outputs.py == 'true'
uses: astral-sh/setup-uv@v3

- name: Set up Python
if: needs.changes.outputs.py == 'true'
run: uv python install 3.12

- name: Sync workspace + satellite extra
if: needs.changes.outputs.py == 'true'
run: uv sync --all-packages --extra satellite

- name: Satellite coverage (4 modules, >= 80% line via sysmon)
if: needs.changes.outputs.py == 'true'
run: bash scripts/satellite_coverage.sh
27 changes: 27 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,33 @@ const result = validateRows(rows, "schema.observation.v1");
// — ready to pass through to an agent's tool-call response.
```

## GOES satellite (Phase 25)

GOES-16/19 ABI L2 single-pixel extraction from NOAA's anonymous public NODD
buckets — a leakage-safe **feature supplement** (cloud-mask / land-surface
covariates), not a primary Tmax/Tmin settlement source. Ships as the optional
`mostlyrightmd-weather[satellite]` extra (whole-file S3/GCS reads via
`s3fs`/`gcsfs` + `h5netcdf`; no hosted backend — reads the same anonymous public
buckets as the AWC/IEM/NWP calls).

```bash
pip install mostlyrightmd-weather[satellite]
```

```python
from mostlyright.weather.satellite import satellite
df = satellite("KNYC", "goes16", product="ABI-L2-ACMC", start=..., end=...)
```

The fleet bulk/training path is `python -m mostlyright.weather.satellite
backfill` (per-`(satellite,year,month)` slices, crash-safe resume,
`--mirror aws|gcp`, Thread/Process split). `max_workers` + the S3 rate cap are
probe-derived constants — run `python -m mostlyright.weather.satellite probe` to
re-measure. See [docs/satellite.md](docs/satellite.md) for cheap-CONUS steering,
DSRF gating, the 28 TB / near-data reality, and the deferred-paid-adapter note
(the future paid adapter shares the `noaa_goes` source identity — byte-identical
— distinguished only by the informational `delivery` lineage column).

## Why mostlyright

- **No hosted backend.** Direct calls to public APIs (NOAA, NWS, IEM, Kalshi, Polymarket). No proxy. No vendor account. No rate-limited tier.
Expand Down
176 changes: 176 additions & 0 deletions docs/satellite.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
# GOES Satellite (Phase 25)

GOES-16/19 ABI Level-2 **single-pixel** extraction from NOAA's anonymous public
NODD buckets — a leakage-safe **feature supplement** for prediction-market
weather research. Ships as the optional `mostlyrightmd-weather[satellite]` extra
(mirrors the `[nwp]` extra and the `forecast_nwp()` pipeline shape).

> **This is a feature supplement, not a primary signal.** For daily Tmax/Tmin
> settlement (Kalshi NHIGH/NLOW) the NWP forecasts (`forecast_nwp()`) and CLI
> settlements are the load-bearing inputs. Satellite cloud-mask / land-surface
> features are MARGINAL for raw temperature highs/lows — useful as a model
> covariate (cloud cover, clear-sky flags), not as the settlement source.

## Install

```bash
pip install mostlyrightmd-weather[satellite]
```

The extra brings `boto3` (anonymous UNSIGNED listing), `s3fs` + `gcsfs`
(whole-file reads), `h5netcdf` (HDF5 decode via wheel — no system `libhdf5`),
`xarray`, `numpy`, `pandas`. The module imports cleanly without the extra; the
heavy deps are lazy-imported inside `satellite()` and a missing extra raises a
`SourceUnavailableError` with the install hint.

## Quick Start

```python
from datetime import datetime, UTC
from mostlyright.weather.satellite import satellite

# Cheap CONUS cloud-mask for one station, one day.
df = satellite(
"KNYC",
"goes16",
product="ABI-L2-ACMC", # primary cheap CONUS product
start=datetime(2024, 6, 15, tzinfo=UTC),
end=datetime(2024, 6, 15, 23, 59, tzinfo=UTC),
)
# One row per (station, variable, scan_start); leakage-safe overlay columns
# (source, event_time, knowledge_time, retrieved_at) + qc_status.
```

### `--mirror aws|gcp` (transport-only)

Both `satellite(..., mirror=...)` and the backfill CLI `--mirror` accept
`"aws"` (default) or `"gcp"`. The fleet backfill should run **in-region** on
whichever cloud you use — AWS `us-east-1` or GCS `us-central1` — so the 28 TB of
transient download bandwidth is free (NODD egress).

`mirror` is a **transport choice only**: the same NOAA GOES product lands in the
same cache partition (`~/.mostlyright/cache/v1/satellite/{satellite}/{product}/{station}/{YYYY}/{MM}.parquet`)
whether fetched from AWS (`noaa-goes16` / `noaa-goes19`) or GCS
(`gcp-public-data-goes-16` / `gcp-public-data-goes-19`). It does **not** change
`df.attrs["source"]` (`"noaa_goes"` for both mirrors) and is **not** a schema
column.

## Products

The extractor carries the full registry, but the public default and these docs
steer to **cheap CONUS** products. DSRF (full-disk) is gated — see below.

| Product | Scale | Notes |
|---|---|---|
| **ABI-L2-ACMC** | CONUS ~0.3–1.5 MB/file | **Primary.** Clear-Sky / Cloud Mask. The cheap default. |
| ABI-L2-LSTC | CONUS ~0.3–1.5 MB/file | Land Surface Temperature. |
| ABI-L2-DSIC / TPWC | CONUS ~0.3–1.5 MB/file | Derived stability / total precipitable water. |
| ABI-L2-DSRF | full-disk ~50 MB/file | **GATED.** Downward Shortwave Radiation, full-disk (~25 of the 28 TB v1 corpus). |

### DSRF gating

The live `satellite(..., product="ABI-L2-DSRF")` path emits a one-time warning:
DSRF is full-disk (~50 MB/file) and dominates the v1 corpus. The live fetcher
fetches per-scan and will **never silently start a multi-TB download** — for bulk
DSRF pulls use the backfill CLI and run it **in-region** (near-data compute).

## QC: annotate-never-drop

Every row carries `qc_status ∈ {clean, flagged, suspect}` — no row is dropped, no
quarantine file. The severity is deliberately inverted: a physics-violating pixel
is almost always an *extraction* bug, so an error-class finding maps to
`suspect` (kept for inspection) and a warning-class finding maps to `flagged`. A
`pixel_value=None` on a NetCDF `_FillValue` is a clean data condition, not an
error.

## Leakage safety

`scan_start_utc` is event-time (parsed from the NetCDF filename, stdlib only);
`as_of_time` / `knowledge_time` is knowledge-time, stamped at fetch (or the
backfill `ingested_at`). Both flow through the SDK's `KnowledgeView` /
`assert_no_leakage`, so a satellite feature backtests the same way it trades —
pass `as_of=<TimePoint|datetime>` to filter on typed datetimes (never a lexical
string snapshot).

## Cache

`~/.mostlyright/cache/v1/satellite/{satellite}/{product}/{station}/{YYYY}/{MM}.parquet`,
filelock-guarded, atomic write, deduped first-seen-wins on
`(station, satellite, product, variable, pressure_level, scan_start)`. The cache
partition is mirror-invariant (D9).

## Bulk backfill + the 28 TB reality

The fleet backfill is the bulk/training path:

```bash
python -m mostlyright.weather.satellite backfill \
--satellites goes16,goes19 \
--products ABI-L2-ACMC \
--stations KNYC \
--year-start 2024 --year-end 2024 \
--out ~/.mostlyright/cache/ \
--max-workers 8 \
--executor thread \
--mirror aws \
--resume
```

Per-`(satellite, year, month)` array-job-friendly slices; crash-safe resume
(malformed-key rejection + fsync durability + `.bak` fallback + a single-writer
lockfile); `--executor thread` for small CONUS files, `--executor process` for
DSRF full-disk decode (CPU-bound + GIL-serialized + behind the HDF5 global
mutex). Slices write **directly** to the per-partition cache (no staging, no
intermediate object store).

**Why whole-file reads, not byte-range / lazy.** The transport reads the ENTIRE
object in one shot — NOT a byte-range / lazy `fs.open` handed to xarray. Single-
pixel byte-range was measured ~4× slower than a full download on a 37 MB DSRF
file (the HDF5 metadata b-tree walk dominates), and the lazy per-range path on
GCS triggers a per-range SSL re-handshake that serializes the pool. So the read
primitive is a single full-object `cat_file` into an in-memory buffer.

**Scale.** CONUS ~0.3–1.5 MB/file; DSRF full-disk ~50 MB/file; the full v1 corpus
is ≈ 3.67 M files / ~28 TB of transient download → a **tiny** parquet output
(one float per station per scan). ACMC for one station over ~2 years is ≈ 200 GB
download / ~5 h on home internet. Because the download is transient and the
output is tiny, the fleet model is **near-data compute in-region** — run the
backfill on a VM in the same cloud region as the bucket (free egress), keep only
the parquet.

### Concurrency: `max_workers` + the S3 rate cap are probe-DERIVED

`max_workers` and the S3 rate-limiter cap are **constants derived from the
satellite rate-limit probe** (mirroring how `forecasts.md`'s
`NOMADS_CONCURRENCY_CAP=4` is documented as empirically derived). They are NOT
guessed and NOT a bare "UNTUNED" caveat: the shipped `_GOES_S3_RATE_HZ` and
`_DEFAULT_MAX_WORKERS` in `satellite/_backfill.py` are floored at the values the
probe records.

Run the probe to (re-)measure the anonymous-throttle / diminishing-returns knee:

```bash
python -m mostlyright.weather.satellite probe --mirror aws --out .planning/research
```

It measures ListObjectsV2 latency, single-file throughput, and a 1/4/8/16/32
concurrency sweep, then writes a findings artifact + a satellite section into
`SOURCE-LIMITS.md` — the place the shipped constants cite as provenance. A
provenance-lock test asserts the shipped constants are floored at / match those
recorded values, so the probe RESULT governs the default, not a doc-only note.
Until you run it in-region, the constants stay conservative
(`_GOES_S3_RATE_HZ=20.0`, `_DEFAULT_MAX_WORKERS=8`).

## Deferred: paid adapter

This phase ships the **free local tier only**. A future paid adapter
(`strategy="hosted"`) will read a pre-extracted catalog — but it SHARES the
`noaa_goes` source identity (it is byte-identical to live self-extraction) and is
distinguished only by the informational `delivery` lineage column, so a model
trained on adapter data reconciles with live self-extraction (no source drift).

## See also

- [`docs/forecasts.md`](forecasts.md) — the NWP forecast path (the load-bearing
Tmax/Tmin signal).
- [`docs/forecast-sources.md`](forecast-sources.md) — forecast source catalog.
Loading
Loading