This repository is a wrapper-only performance benchmark for FieldSpaceNN. It measures runtime and hardware behavior for short HEALPix training runs and intentionally does not evaluate model accuracy.
- Creates a local
venv-based environment and installs FieldSpaceNN from GitHub. - Generates deterministic synthetic HEALPix
tasdata in Zarr format underdata/, defaulting to level 10. - Launches FieldSpaceNN training through the upstream entrypoint with benchmark-owned Hydra configs.
- Collects iterations per second, end-to-end wall time, GPU memory, and GPU utilization.
- Writes CSV outputs and plots benchmark summaries with matplotlib.
This benchmark was designed against the current upstream FieldSpaceNN repository at commit d8eac2988e305f1fd5d6192940f08235bca0ec26.
Key findings from the inspection:
- The public training entrypoint is still:
python -m fieldspacenn.src.train -cp fieldspacenn/configs -cn mg_transformer_train - The upstream
train.pyinstantiatescfg.logger,cfg.model,cfg.trainer, andcfg.dataloader.datamoduledirectly from Hydra, so callback injection and logger replacement work cleanly from this repo. - The stock
mg_transformer_trainconfig is level-6 oriented:model.in_zooms: [3, 5, 6]mgrids.zoom_max: 6 - The stock healpix loader path uses
fieldspacenn.src.data.datasets_healpix.HealPixLoader, which inherits file loading fromBaseDataset.get_files(). That path usesxr.open_dataset()/xr.load_dataset()even for.zarrinputs. - I explicitly tested
.zarrloading withxarray 2023.4.2andzarr 2.14.2in this environment:xr.open_dataset("...zarr"),xr.load_dataset("...zarr"), andxr.open_zarr("...zarr")all worked.
Because of those findings, this repo keeps the benchmark logic outside FieldSpaceNN itself:
- benchmark-owned Hydra configs live in
benchmarking/configs/ - a benchmark-side Lightning callback records iteration timing
- a benchmark-side logger disables snapshot/image logging
- the wrapper launches the upstream training module as a subprocess
No FieldSpaceNN source changes are required for the default benchmark path.
pyproject.toml: benchmark package metadata and benchmark-only Python dependenciesscripts/setup_env.sh: localvenvsetup scriptscripts/generate_synthetic_healpix_zarr.py: CLI wrapper for synthetic data generationscripts/run_level8_compare.slurm: main SLURM workflow that comparesfull_cellvstime_majorat complexity1and4, then plots resultsscripts/run_benchmark.slurm: SLURM launcher for single-node or multi-node runsscripts/plot_benchmark_results.py: matplotlib plotting scriptbenchmarking/run_benchmark.py: wrapper launcher around upstream FieldSpaceNN trainingbenchmarking/monitor.py: GPU monitoring viapynvmlwithnvidia-smifallbackbenchmarking/callbacks.py: benchmark timing callback injected through Hydrabenchmarking/local_logger.py: no-op/local logger used to avoid Weights & Biases by defaultbenchmarking/configs/: benchmark-owned Hydra config treedata/: synthetic benchmark data and normalization metadataresults/: benchmark CSV outputs, logs, and plots
Run the setup script from this repository root:
bash scripts/setup_env.shThe script:
- loads
gitandpython3modules if available - creates
.venv - upgrades
pip,setuptools, andwheel - installs this benchmark repo
- installs
torch==2.4.0,torchvision==0.19.0, andtorchaudio==2.4.0from the CUDA 12.1 PyTorch wheel index by default - installs FieldSpaceNN from GitHub
Useful environment overrides:
PYTORCH_INDEX_URLFIELDSPACENN_REFVENV_DIR
The main intended workflow is a SLURM submission that runs four benchmarks on the same level-8, 4096-timestep dataset shape:
full_cell, complexity1full_cell, complexity4time_major, complexity1time_major, complexity4
After all four runs finish, the job generates plots from the benchmark CSV.
Submit it from the repository root with:
sbatch scripts/run_level8_compare.slurmThe script defaults to:
LEVEL=8TIMESTEPS=4096- chunk/complexity combinations:
full_cell,c1,full_cell,c4,time_major,c1,time_major,c4 - plots written from
results/benchmark_runs.csvintoresults/plots/
Useful overrides for this workflow:
RUN_PREFIX: prefix for the four run ids, which become<prefix>-full_cell-c1,<prefix>-full_cell-c4,<prefix>-time_major-c1, and<prefix>-time_major-c4MAX_STEPSBATCH_SIZEDATALOADER_WORKERSPRECISIONPLOT_CSV_PATHPLOT_OUTPUT_DIRBENCHMARK_EXTRA_ARGS
Example:
RUN_PREFIX=chunk-compare MAX_STEPS=5000 sbatch scripts/run_level8_compare.slurmGenerate the default small dataset:
source .venv/bin/activate
python scripts/generate_synthetic_healpix_zarr.py --chunk-label time_majorThis writes:
- a Zarr store such as
data/tas_level10_t6_time_major.zarr - normalization metadata at
data/input/norm_dict.json
The benchmark uses the requested top zoom plus three coarser zooms with step size 2.
Example: --level 8 resolves benchmark zooms 8, 6, 4, 2.
Dataset characteristics:
- variable:
tas - dtype:
float32 - dimensions:
time,cell - time coordinate: realistic Unix timestamps so upstream
TimeEmbedderreceives values in the same range as the stock config - spatial coordinate:
cellindex only
The generator is deterministic and chunk-aware. Default chunk presets are:
time_major:(time=1, cell=262144)balanced:(time=2, cell=131072)spatial_heavy:(time=1, cell=1048576)full_cell:(time=1, cell=12582912)full_time:(time=all timesteps, cell=npix // 64)
Custom chunking is also supported:
python scripts/generate_synthetic_healpix_zarr.py \
--chunk-label custom \
--time-chunk 2 \
--cell-chunk 524288 \
--forceThe wrapper defaults to a short performance-only run:
- level-10 HEALPix synthetic
tas - four-zoom benchmark model config derived from the requested top zoom
- no checkpointing
- no validation loop during the short run
- no external logging backend
Smoke test example:
python -m benchmarking.run_benchmark --smoke-test --accelerator cpuGPU example:
python -m benchmarking.run_benchmark \
--accelerator gpu \
--level 8 \
--model-complexity 2 \
--batch-size 1 \
--num-workers 8 \
--chunk-label balanced \
--max-steps 8Passing raw Hydra overrides through the wrapper:
python -m benchmarking.run_benchmark \
--chunk-label time_major \
--override trainer.precision=bf16-mixed \
--override model.lr_groups.default.lr=1e-4Submit with default header values:
sbatch scripts/run_benchmark.slurmOverride scheduler resources at submission time:
sbatch \
--account=<account> \
--partition=<partition> \
--time=01:00:00 \
--nodes=2 \
--gpus-per-node=4 \
--cpus-per-task=16 \
--mem=64G \
scripts/run_benchmark.slurmRuntime knobs are exposed through environment variables:
RUN_IDTRAINER_ACCELERATORTRAINER_NUM_NODESGPUS_PER_NODETRAINER_STRATEGYDATALOADER_WORKERSBATCH_SIZECHUNK_LABELLEVELTIMESTEPSMODEL_COMPLEXITYMAX_STEPSSAMPLE_INTERVAL_SPRECISIONBENCHMARK_EXTRA_ARGS
Example multi-node submission:
RUN_ID=slurm-test \
TRAINER_NUM_NODES=2 \
GPUS_PER_NODE=4 \
TRAINER_STRATEGY=ddp \
CHUNK_LABEL=spatial_heavy \
MODEL_COMPLEXITY=2 \
MAX_STEPS=8 \
sbatch --nodes=2 --gpus-per-node=4 scripts/run_benchmark.slurmMain artifacts:
results/benchmark_runs.csv: one row per benchmark runresults/gpu_samples.csv: optional time-series GPU samples merged across nodesresults/artifacts/<run_id>/train_rank<global_rank>_<host>.log: upstream training stdout/stderr per rank under SLURMresults/artifacts/<run_id>/callback_metrics.json: callback-side timing summary, including both batch-timed and effective throughputresults/artifacts/<run_id>/gpu_samples_<run_id>_<host>.csv: host-local GPU samplesresults/artifacts/<run_id>/monitor_<host>.json: host-local GPU summaryresults/logs/<run_id>/composed_config.yaml: composed Hydra config from the benchmark logger
The run-level CSV stores:
- run metadata
- chunk configuration
- override summary
- start/end/wall time
- completed train steps
- iterations per second
- per-GPU peak memory and utilization as JSON-encoded maps
Create plots from the run CSV:
python scripts/plot_benchmark_results.py --csv-path results/benchmark_runs.csvThis writes plots into results/plots/, including:
- batch-timed throughput by chunk configuration
- effective throughput by chunk configuration
- wall time by chunk configuration
- peak GPU memory by chunk configuration
- mean/max GPU utilization by chunk configuration
The two throughput metrics mean:
- batch-timed throughput:
num_train_steps_completed / timed_batch_seconds; this excludes most dataloader and between-batch overhead - effective throughput:
num_train_steps_completed / fit_wall_time_s; this is closer to the training rate shown in Lightning progress output
Typical extension points:
- change the HEALPix benchmark level with
--levelorLEVEL - change model size with
--model-complexityorMODEL_COMPLEXITY; the benchmark usesatt_dim = 512 * model_complexity - change chunk layouts with
--chunk-labelor custom chunk sizes - change trainer settings with wrapper flags or
--override ... - change the benchmark model by editing
benchmarking/configs/model/mg_transformer.yaml - change data sampling windows in
benchmarking/configs/dataloader/zarr_healpix.yaml
If you want to benchmark a different FieldSpaceNN model family, keep the same wrapper pattern:
- leave upstream code untouched if possible
- add a benchmark-owned Hydra config
- keep logging/checkpointing/validation overhead minimal
- keep result writing inside this repo
- This benchmark is for performance statistics only. It does not compute accuracy or scientific skill metrics.
- The default model config is intentionally small and short-running. It is meant to exercise the training stack, I/O path, and GPU behavior, not to produce useful climate predictions.
- The explicit
.zarrcompatibility test succeeded with the xarray/zarr versions available during development, so no benchmark-side loader shim was added. If your site installs a materially different xarray stack and.zarrloading breaks, the first place to re-check is FieldSpaceNN’sBaseDataset.get_files()path.