Improve database startup data population by cchwala · Pull Request #25 · OpenSenseAction/GMDI_prototype

cchwala · 2026-03-10T18:37:10Z

PR summary:

Decouple archive loading from DB init: new archive_loader service runs after DB is healthy
Real-time data flows within ~10 s; archive history visible after ~90 s
Add DB healthcheck; parser waits for service_healthy before starting
init_archive_data.sh made idempotent (ON CONFLICT DO NOTHING, CREATE INDEX IF NOT EXISTS)
Stats thread retries on connection failure; tests updated for refactored generate_archive_data

- generate_archive.py: add --days, --interval-seconds CLI args and ARCHIVE_DAYS, ARCHIVE_INTERVAL_SECONDS env var support; timestamps computed at runtime so archive is always relative to 'now' - Dockerfile: copy generate_archive.py into the simulator image - docker-compose.yml: add archive_generator service that runs before database starts (service_completed_successfully dependency); switch archive data from bind-mount to named volume - config.yml: generation frequency 30s->10s, 1 timestamp per file

update_cml_stats was doing a full table scan for all CMLs on every incoming 10-second file, causing ~4 minute processing delays. - db_writer.py: remove update_cml_stats from write_rawdata; add refresh_stats() method intended for background use only - main.py: run stats refresh in a dedicated daemon thread on a configurable timer (STATS_REFRESH_INTERVAL env var, default 60s) with its own DB connection so inserts are never blocked - init.sql: add idx_cml_data_cml_id index on cml_data(cml_id, time DESC) to speed up per-CML stats queries as data grows

Previously generate_archive.py called isel()+to_dataframe() once per archive timestamp (up to 864,000 calls for 10 days at 10s), and wrote the output through gzip compression. Combined: ~20 min for 10 days. - Pre-cache all unique NetCDF time slices (max 720) into numpy arrays upfront; use np.tile/repeat/concatenate to build batches of 5000 timestamps at once -- eliminates repeated xarray overhead - Drop gzip entirely: files live only in a Docker volume, compression was unnecessary and dominated write time - Output filenames change from .csv.gz to .csv - Update test to reflect new function signature and plain-file output

- Move index creation from init.sql to after \COPY in init_archive_data.sh: building the index on an empty table was maintained during bulk load, doubling COPY time; post-load index build is a single sequential scan - Drop update_cml_stats call from init script: it scanned all rows for 364 CMLs before the DB was ready, adding 1-2 min; the parser's background stats thread handles this immediately on startup instead - Result: 2-day 10s archive init time reduced from ~5 min to ~98 s

The background stats thread previously waited a full STATS_REFRESH_INTERVAL (60s) before its first run, so Grafana dashboards backed by cml_stats showed no data for up to a minute after the parser came up. Run refresh_stats() once immediately after connecting, then fall into the timed loop as before.

cchwala · 2026-03-10T18:38:58Z

Current timings during startup:

Startup timing (1 day archive, 10 s interval, 6.3 M rows)

Phase	Duration
`database` healthy (schema only, no data)	~6 s
`parser` ready + first real-time data flowing	~10 s from T0
`archive_generator` — generate CSVs (6.3 M rows, 336 MB)	43 s
`archive_loader` — `COPY` 6.3 M rows into DB	33 s
`archive_loader` — build index	10 s
archive_loader total	~43 s (runs in parallel with real-time ingestion)
Total cold-start → Grafana shows real-time data	~10 s
Total cold-start → Grafana shows full archive history	~90 s

Real-time data flows immediately once the DB is healthy (~10 s).
Archive history loads in the background without blocking or delaying the parser.

- Refactor init_archive_data.sh to work both as a PostgreSQL init-phase script (Unix socket, no PGHOST) and as a standalone post-startup container (PGHOST set for TCP connection) - Credentials resolve via PGUSER/PGDATABASE env vars with fallback to POSTGRES_USER/POSTGRES_DB so the script works in both contexts - Replace bare \COPY cml_metadata with a temp-table approach: COPY → tmp, INSERT INTO cml_metadata ON CONFLICT DO NOTHING so the script never aborts when the parser has already inserted metadata from a real-time upload before the loader runs - Change CREATE INDEX to CREATE INDEX IF NOT EXISTS for the same reason - Update comment in init.sql to reference archive_loader service

Previously the archive data was bulk-loaded inside PostgreSQL's init phase (docker-entrypoint-initdb.d), which blocked all TCP connections for 1-3 minutes. Services that depended on 'database' would start, fail to connect, and the parser's stats thread would exit permanently. New startup sequence: archive_generator → generates CSVs (runs before DB starts) database → schema-only init, healthy in ~6 s parser → starts immediately when DB is healthy (~10 s T0) archive_loader → loads CSVs after DB is healthy, in background Changes: - docker-compose.yml: - Add healthcheck to database (pg_isready, 5 s interval) - Add archive_loader service: runs init_archive_data.sh as a standalone container after archive_generator completes and DB is healthy; does not block any other service - Remove archive volume mounts from database service (no longer loaded during DB init) - Change parser depends_on to condition: service_healthy so it only starts when the DB is actually accepting connections - parser/main.py: - Stats thread retries DB connection every 5 s instead of giving up after 3 attempts; prevents permanent silent failure if DB is momentarily unreachable Result: real-time data flows within ~10 s of docker compose up; archive history appears in Grafana ~90 s later without any gap or delay in real-time ingestion.

…rchive_data After the numpy-cached rewrite: - generate_archive_data() requires 4 positional args (archive_days, output_dir, netcdf_file, interval_seconds) -- update both tests to pass them explicitly - The generation path no longer calls generator.generate_data(); it accesses generator.dataset.isel(), _get_netcdf_index_for_timestamp(), and original_time_points directly -- mock those instead - pandas to_csv() calls pathlib.Path.is_dir() internally; patch it to return True so the mock_open approach still works - Add numpy import (used for mock attribute setup)

codecov · 2026-03-10T19:07:33Z

Codecov Report

❌ Patch coverage is 53.26087% with 43 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.17%. Comparing base (019c131) to head (64649c1).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
parser/main.py	0.00%	27 Missing ⚠️
parser/db_writer.py	6.66%	14 Missing ⚠️
mno_data_source_simulator/generate_archive.py	96.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #25      +/-   ##
==========================================
- Coverage   81.21%   72.17%   -9.04%     
==========================================
  Files          16       22       +6     
  Lines         958     1980    +1022     
==========================================
+ Hits          778     1429     +651     
- Misses        180      551     +371

Flag	Coverage Δ
mno_simulator	`88.16% <96.00%> (?)`
parser	`77.91% <2.38%> (-3.30%)`	⬇️
webserver	`44.73% <ø> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

cchwala added 5 commits March 10, 2026 13:46

cchwala added 3 commits March 10, 2026 19:43

cchwala merged commit 5c081f8 into main Mar 10, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve database startup data population#25

Improve database startup data population#25
cchwala merged 8 commits intomainfrom
database_startup_data_population

cchwala commented Mar 10, 2026 •

edited

Loading

Uh oh!

cchwala commented Mar 10, 2026

Uh oh!

Uh oh!

codecov bot commented Mar 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cchwala commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cchwala commented Mar 10, 2026

Startup timing (1 day archive, 10 s interval, 6.3 M rows)

Uh oh!

Uh oh!

codecov bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cchwala commented Mar 10, 2026 •

edited

Loading

codecov bot commented Mar 10, 2026 •

edited

Loading