Skip to content

Improve database startup data population#25

Merged
cchwala merged 8 commits intomainfrom
database_startup_data_population
Mar 10, 2026
Merged

Improve database startup data population#25
cchwala merged 8 commits intomainfrom
database_startup_data_population

Conversation

@cchwala
Copy link
Member

@cchwala cchwala commented Mar 10, 2026

PR summary:

  • Decouple archive loading from DB init: new archive_loader service runs after DB is healthy
  • Real-time data flows within ~10 s; archive history visible after ~90 s
  • Add DB healthcheck; parser waits for service_healthy before starting
  • init_archive_data.sh made idempotent (ON CONFLICT DO NOTHING, CREATE INDEX IF NOT EXISTS)
  • Stats thread retries on connection failure; tests updated for refactored generate_archive_data

cchwala added 5 commits March 10, 2026 13:46
- generate_archive.py: add --days, --interval-seconds CLI args and
  ARCHIVE_DAYS, ARCHIVE_INTERVAL_SECONDS env var support; timestamps
  computed at runtime so archive is always relative to 'now'
- Dockerfile: copy generate_archive.py into the simulator image
- docker-compose.yml: add archive_generator service that runs before
  database starts (service_completed_successfully dependency); switch
  archive data from bind-mount to named volume
- config.yml: generation frequency 30s->10s, 1 timestamp per file
update_cml_stats was doing a full table scan for all CMLs on every
incoming 10-second file, causing ~4 minute processing delays.

- db_writer.py: remove update_cml_stats from write_rawdata; add
  refresh_stats() method intended for background use only
- main.py: run stats refresh in a dedicated daemon thread on a
  configurable timer (STATS_REFRESH_INTERVAL env var, default 60s)
  with its own DB connection so inserts are never blocked
- init.sql: add idx_cml_data_cml_id index on cml_data(cml_id, time DESC)
  to speed up per-CML stats queries as data grows
Previously generate_archive.py called isel()+to_dataframe() once per
archive timestamp (up to 864,000 calls for 10 days at 10s), and wrote
the output through gzip compression. Combined: ~20 min for 10 days.

- Pre-cache all unique NetCDF time slices (max 720) into numpy arrays
  upfront; use np.tile/repeat/concatenate to build batches of 5000
  timestamps at once -- eliminates repeated xarray overhead
- Drop gzip entirely: files live only in a Docker volume, compression
  was unnecessary and dominated write time
- Output filenames change from .csv.gz to .csv
- Update test to reflect new function signature and plain-file output
- Move index creation from init.sql to after \COPY in init_archive_data.sh:
  building the index on an empty table was maintained during bulk load,
  doubling COPY time; post-load index build is a single sequential scan
- Drop update_cml_stats call from init script: it scanned all rows for
  364 CMLs before the DB was ready, adding 1-2 min; the parser's
  background stats thread handles this immediately on startup instead
- Result: 2-day 10s archive init time reduced from ~5 min to ~98 s
The background stats thread previously waited a full STATS_REFRESH_INTERVAL
(60s) before its first run, so Grafana dashboards backed by cml_stats showed
no data for up to a minute after the parser came up.

Run refresh_stats() once immediately after connecting, then fall into the
timed loop as before.
@cchwala
Copy link
Member Author

cchwala commented Mar 10, 2026

Current timings during startup:

Startup timing (1 day archive, 10 s interval, 6.3 M rows)

Phase Duration
database healthy (schema only, no data) ~6 s
parser ready + first real-time data flowing ~10 s from T0
archive_generator — generate CSVs (6.3 M rows, 336 MB) 43 s
archive_loaderCOPY 6.3 M rows into DB 33 s
archive_loader — build index 10 s
archive_loader total ~43 s (runs in parallel with real-time ingestion)
Total cold-start → Grafana shows real-time data ~10 s
Total cold-start → Grafana shows full archive history ~90 s

Real-time data flows immediately once the DB is healthy (~10 s).
Archive history loads in the background without blocking or delaying the parser.

cchwala added 3 commits March 10, 2026 19:43
- Refactor init_archive_data.sh to work both as a PostgreSQL init-phase
  script (Unix socket, no PGHOST) and as a standalone post-startup
  container (PGHOST set for TCP connection)
- Credentials resolve via PGUSER/PGDATABASE env vars with fallback to
  POSTGRES_USER/POSTGRES_DB so the script works in both contexts
- Replace bare \COPY cml_metadata with a temp-table approach:
    COPY → tmp, INSERT INTO cml_metadata ON CONFLICT DO NOTHING
  so the script never aborts when the parser has already inserted
  metadata from a real-time upload before the loader runs
- Change CREATE INDEX to CREATE INDEX IF NOT EXISTS for the same reason
- Update comment in init.sql to reference archive_loader service
Previously the archive data was bulk-loaded inside PostgreSQL's init
phase (docker-entrypoint-initdb.d), which blocked all TCP connections
for 1-3 minutes. Services that depended on 'database' would start,
fail to connect, and the parser's stats thread would exit permanently.

New startup sequence:
  archive_generator → generates CSVs (runs before DB starts)
  database          → schema-only init, healthy in ~6 s
  parser            → starts immediately when DB is healthy (~10 s T0)
  archive_loader    → loads CSVs after DB is healthy, in background

Changes:
- docker-compose.yml:
  - Add healthcheck to database (pg_isready, 5 s interval)
  - Add archive_loader service: runs init_archive_data.sh as a
    standalone container after archive_generator completes and DB
    is healthy; does not block any other service
  - Remove archive volume mounts from database service (no longer
    loaded during DB init)
  - Change parser depends_on to condition: service_healthy so it
    only starts when the DB is actually accepting connections
- parser/main.py:
  - Stats thread retries DB connection every 5 s instead of giving
    up after 3 attempts; prevents permanent silent failure if DB is
    momentarily unreachable

Result: real-time data flows within ~10 s of docker compose up;
archive history appears in Grafana ~90 s later without any gap or
delay in real-time ingestion.
…rchive_data

After the numpy-cached rewrite:
- generate_archive_data() requires 4 positional args (archive_days,
  output_dir, netcdf_file, interval_seconds) -- update both tests to
  pass them explicitly
- The generation path no longer calls generator.generate_data(); it
  accesses generator.dataset.isel(), _get_netcdf_index_for_timestamp(),
  and original_time_points directly -- mock those instead
- pandas to_csv() calls pathlib.Path.is_dir() internally; patch it to
  return True so the mock_open approach still works
- Add numpy import (used for mock attribute setup)
@cchwala cchwala merged commit 5c081f8 into main Mar 10, 2026
5 checks passed
@codecov
Copy link

codecov bot commented Mar 10, 2026

Codecov Report

❌ Patch coverage is 53.26087% with 43 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.17%. Comparing base (019c131) to head (64649c1).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
parser/main.py 0.00% 27 Missing ⚠️
parser/db_writer.py 6.66% 14 Missing ⚠️
mno_data_source_simulator/generate_archive.py 96.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #25      +/-   ##
==========================================
- Coverage   81.21%   72.17%   -9.04%     
==========================================
  Files          16       22       +6     
  Lines         958     1980    +1022     
==========================================
+ Hits          778     1429     +651     
- Misses        180      551     +371     
Flag Coverage Δ
mno_simulator 88.16% <96.00%> (?)
parser 77.91% <2.38%> (-3.30%) ⬇️
webserver 44.73% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant