Merged
Conversation
- generate_archive.py: add --days, --interval-seconds CLI args and ARCHIVE_DAYS, ARCHIVE_INTERVAL_SECONDS env var support; timestamps computed at runtime so archive is always relative to 'now' - Dockerfile: copy generate_archive.py into the simulator image - docker-compose.yml: add archive_generator service that runs before database starts (service_completed_successfully dependency); switch archive data from bind-mount to named volume - config.yml: generation frequency 30s->10s, 1 timestamp per file
update_cml_stats was doing a full table scan for all CMLs on every incoming 10-second file, causing ~4 minute processing delays. - db_writer.py: remove update_cml_stats from write_rawdata; add refresh_stats() method intended for background use only - main.py: run stats refresh in a dedicated daemon thread on a configurable timer (STATS_REFRESH_INTERVAL env var, default 60s) with its own DB connection so inserts are never blocked - init.sql: add idx_cml_data_cml_id index on cml_data(cml_id, time DESC) to speed up per-CML stats queries as data grows
Previously generate_archive.py called isel()+to_dataframe() once per archive timestamp (up to 864,000 calls for 10 days at 10s), and wrote the output through gzip compression. Combined: ~20 min for 10 days. - Pre-cache all unique NetCDF time slices (max 720) into numpy arrays upfront; use np.tile/repeat/concatenate to build batches of 5000 timestamps at once -- eliminates repeated xarray overhead - Drop gzip entirely: files live only in a Docker volume, compression was unnecessary and dominated write time - Output filenames change from .csv.gz to .csv - Update test to reflect new function signature and plain-file output
- Move index creation from init.sql to after \COPY in init_archive_data.sh: building the index on an empty table was maintained during bulk load, doubling COPY time; post-load index build is a single sequential scan - Drop update_cml_stats call from init script: it scanned all rows for 364 CMLs before the DB was ready, adding 1-2 min; the parser's background stats thread handles this immediately on startup instead - Result: 2-day 10s archive init time reduced from ~5 min to ~98 s
The background stats thread previously waited a full STATS_REFRESH_INTERVAL (60s) before its first run, so Grafana dashboards backed by cml_stats showed no data for up to a minute after the parser came up. Run refresh_stats() once immediately after connecting, then fall into the timed loop as before.
Member
Author
|
Current timings during startup: Startup timing (1 day archive, 10 s interval, 6.3 M rows)
Real-time data flows immediately once the DB is healthy (~10 s). |
- Refactor init_archive_data.sh to work both as a PostgreSQL init-phase
script (Unix socket, no PGHOST) and as a standalone post-startup
container (PGHOST set for TCP connection)
- Credentials resolve via PGUSER/PGDATABASE env vars with fallback to
POSTGRES_USER/POSTGRES_DB so the script works in both contexts
- Replace bare \COPY cml_metadata with a temp-table approach:
COPY → tmp, INSERT INTO cml_metadata ON CONFLICT DO NOTHING
so the script never aborts when the parser has already inserted
metadata from a real-time upload before the loader runs
- Change CREATE INDEX to CREATE INDEX IF NOT EXISTS for the same reason
- Update comment in init.sql to reference archive_loader service
Previously the archive data was bulk-loaded inside PostgreSQL's init
phase (docker-entrypoint-initdb.d), which blocked all TCP connections
for 1-3 minutes. Services that depended on 'database' would start,
fail to connect, and the parser's stats thread would exit permanently.
New startup sequence:
archive_generator → generates CSVs (runs before DB starts)
database → schema-only init, healthy in ~6 s
parser → starts immediately when DB is healthy (~10 s T0)
archive_loader → loads CSVs after DB is healthy, in background
Changes:
- docker-compose.yml:
- Add healthcheck to database (pg_isready, 5 s interval)
- Add archive_loader service: runs init_archive_data.sh as a
standalone container after archive_generator completes and DB
is healthy; does not block any other service
- Remove archive volume mounts from database service (no longer
loaded during DB init)
- Change parser depends_on to condition: service_healthy so it
only starts when the DB is actually accepting connections
- parser/main.py:
- Stats thread retries DB connection every 5 s instead of giving
up after 3 attempts; prevents permanent silent failure if DB is
momentarily unreachable
Result: real-time data flows within ~10 s of docker compose up;
archive history appears in Grafana ~90 s later without any gap or
delay in real-time ingestion.
…rchive_data After the numpy-cached rewrite: - generate_archive_data() requires 4 positional args (archive_days, output_dir, netcdf_file, interval_seconds) -- update both tests to pass them explicitly - The generation path no longer calls generator.generate_data(); it accesses generator.dataset.isel(), _get_netcdf_index_for_timestamp(), and original_time_points directly -- mock those instead - pandas to_csv() calls pathlib.Path.is_dir() internally; patch it to return True so the mock_open approach still works - Add numpy import (used for mock attribute setup)
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #25 +/- ##
==========================================
- Coverage 81.21% 72.17% -9.04%
==========================================
Files 16 22 +6
Lines 958 1980 +1022
==========================================
+ Hits 778 1429 +651
- Misses 180 551 +371
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR summary:
archive_loaderservice runs after DB is healthyhealthcheck;parserwaits forservice_healthybefore startinginit_archive_data.shmade idempotent (ON CONFLICT DO NOTHING,CREATE INDEX IF NOT EXISTS)generate_archive_data