Skip to content

FREVA-CLINT/FieldSpaceNN-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FieldSpaceNN Benchmark

This repository is a wrapper-only performance benchmark for FieldSpaceNN. It measures runtime and hardware behavior for short HEALPix training runs and intentionally does not evaluate model accuracy.

What This Repo Does

  • Creates a local venv-based environment and installs FieldSpaceNN from GitHub.
  • Generates deterministic synthetic HEALPix tas data in Zarr format under data/, defaulting to level 10.
  • Launches FieldSpaceNN training through the upstream entrypoint with benchmark-owned Hydra configs.
  • Collects iterations per second, end-to-end wall time, GPU memory, and GPU utilization.
  • Writes CSV outputs and plots benchmark summaries with matplotlib.

Upstream Inspection Notes

This benchmark was designed against the current upstream FieldSpaceNN repository at commit d8eac2988e305f1fd5d6192940f08235bca0ec26.

Key findings from the inspection:

  • The public training entrypoint is still: python -m fieldspacenn.src.train -cp fieldspacenn/configs -cn mg_transformer_train
  • The upstream train.py instantiates cfg.logger, cfg.model, cfg.trainer, and cfg.dataloader.datamodule directly from Hydra, so callback injection and logger replacement work cleanly from this repo.
  • The stock mg_transformer_train config is level-6 oriented: model.in_zooms: [3, 5, 6] mgrids.zoom_max: 6
  • The stock healpix loader path uses fieldspacenn.src.data.datasets_healpix.HealPixLoader, which inherits file loading from BaseDataset.get_files(). That path uses xr.open_dataset() / xr.load_dataset() even for .zarr inputs.
  • I explicitly tested .zarr loading with xarray 2023.4.2 and zarr 2.14.2 in this environment: xr.open_dataset("...zarr"), xr.load_dataset("...zarr"), and xr.open_zarr("...zarr") all worked.

Because of those findings, this repo keeps the benchmark logic outside FieldSpaceNN itself:

  • benchmark-owned Hydra configs live in benchmarking/configs/
  • a benchmark-side Lightning callback records iteration timing
  • a benchmark-side logger disables snapshot/image logging
  • the wrapper launches the upstream training module as a subprocess

No FieldSpaceNN source changes are required for the default benchmark path.

Repository Layout

  • pyproject.toml: benchmark package metadata and benchmark-only Python dependencies
  • scripts/setup_env.sh: local venv setup script
  • scripts/generate_synthetic_healpix_zarr.py: CLI wrapper for synthetic data generation
  • scripts/run_level8_compare.slurm: main SLURM workflow that compares full_cell vs time_major at complexity 1 and 4, then plots results
  • scripts/run_benchmark.slurm: SLURM launcher for single-node or multi-node runs
  • scripts/plot_benchmark_results.py: matplotlib plotting script
  • benchmarking/run_benchmark.py: wrapper launcher around upstream FieldSpaceNN training
  • benchmarking/monitor.py: GPU monitoring via pynvml with nvidia-smi fallback
  • benchmarking/callbacks.py: benchmark timing callback injected through Hydra
  • benchmarking/local_logger.py: no-op/local logger used to avoid Weights & Biases by default
  • benchmarking/configs/: benchmark-owned Hydra config tree
  • data/: synthetic benchmark data and normalization metadata
  • results/: benchmark CSV outputs, logs, and plots

Setup

Run the setup script from this repository root:

bash scripts/setup_env.sh

The script:

  • loads git and python3 modules if available
  • creates .venv
  • upgrades pip, setuptools, and wheel
  • installs this benchmark repo
  • installs torch==2.4.0, torchvision==0.19.0, and torchaudio==2.4.0 from the CUDA 12.1 PyTorch wheel index by default
  • installs FieldSpaceNN from GitHub

Useful environment overrides:

  • PYTORCH_INDEX_URL
  • FIELDSPACENN_REF
  • VENV_DIR

Main Workflow

The main intended workflow is a SLURM submission that runs four benchmarks on the same level-8, 4096-timestep dataset shape:

  • full_cell, complexity 1
  • full_cell, complexity 4
  • time_major, complexity 1
  • time_major, complexity 4

After all four runs finish, the job generates plots from the benchmark CSV.

Submit it from the repository root with:

sbatch scripts/run_level8_compare.slurm

The script defaults to:

  • LEVEL=8
  • TIMESTEPS=4096
  • chunk/complexity combinations: full_cell,c1, full_cell,c4, time_major,c1, time_major,c4
  • plots written from results/benchmark_runs.csv into results/plots/

Useful overrides for this workflow:

  • RUN_PREFIX: prefix for the four run ids, which become <prefix>-full_cell-c1, <prefix>-full_cell-c4, <prefix>-time_major-c1, and <prefix>-time_major-c4
  • MAX_STEPS
  • BATCH_SIZE
  • DATALOADER_WORKERS
  • PRECISION
  • PLOT_CSV_PATH
  • PLOT_OUTPUT_DIR
  • BENCHMARK_EXTRA_ARGS

Example:

RUN_PREFIX=chunk-compare MAX_STEPS=5000 sbatch scripts/run_level8_compare.slurm

Synthetic Data Generation

Generate the default small dataset:

source .venv/bin/activate
python scripts/generate_synthetic_healpix_zarr.py --chunk-label time_major

This writes:

  • a Zarr store such as data/tas_level10_t6_time_major.zarr
  • normalization metadata at data/input/norm_dict.json

The benchmark uses the requested top zoom plus three coarser zooms with step size 2. Example: --level 8 resolves benchmark zooms 8, 6, 4, 2.

Dataset characteristics:

  • variable: tas
  • dtype: float32
  • dimensions: time, cell
  • time coordinate: realistic Unix timestamps so upstream TimeEmbedder receives values in the same range as the stock config
  • spatial coordinate: cell index only

The generator is deterministic and chunk-aware. Default chunk presets are:

  • time_major: (time=1, cell=262144)
  • balanced: (time=2, cell=131072)
  • spatial_heavy: (time=1, cell=1048576)
  • full_cell: (time=1, cell=12582912)
  • full_time: (time=all timesteps, cell=npix // 64)

Custom chunking is also supported:

python scripts/generate_synthetic_healpix_zarr.py \
  --chunk-label custom \
  --time-chunk 2 \
  --cell-chunk 524288 \
  --force

Local Benchmark Runs

The wrapper defaults to a short performance-only run:

  • level-10 HEALPix synthetic tas
  • four-zoom benchmark model config derived from the requested top zoom
  • no checkpointing
  • no validation loop during the short run
  • no external logging backend

Smoke test example:

python -m benchmarking.run_benchmark --smoke-test --accelerator cpu

GPU example:

python -m benchmarking.run_benchmark \
  --accelerator gpu \
  --level 8 \
  --model-complexity 2 \
  --batch-size 1 \
  --num-workers 8 \
  --chunk-label balanced \
  --max-steps 8

Passing raw Hydra overrides through the wrapper:

python -m benchmarking.run_benchmark \
  --chunk-label time_major \
  --override trainer.precision=bf16-mixed \
  --override model.lr_groups.default.lr=1e-4

SLURM Runs

Submit with default header values:

sbatch scripts/run_benchmark.slurm

Override scheduler resources at submission time:

sbatch \
  --account=<account> \
  --partition=<partition> \
  --time=01:00:00 \
  --nodes=2 \
  --gpus-per-node=4 \
  --cpus-per-task=16 \
  --mem=64G \
  scripts/run_benchmark.slurm

Runtime knobs are exposed through environment variables:

  • RUN_ID
  • TRAINER_ACCELERATOR
  • TRAINER_NUM_NODES
  • GPUS_PER_NODE
  • TRAINER_STRATEGY
  • DATALOADER_WORKERS
  • BATCH_SIZE
  • CHUNK_LABEL
  • LEVEL
  • TIMESTEPS
  • MODEL_COMPLEXITY
  • MAX_STEPS
  • SAMPLE_INTERVAL_S
  • PRECISION
  • BENCHMARK_EXTRA_ARGS

Example multi-node submission:

RUN_ID=slurm-test \
TRAINER_NUM_NODES=2 \
GPUS_PER_NODE=4 \
TRAINER_STRATEGY=ddp \
CHUNK_LABEL=spatial_heavy \
MODEL_COMPLEXITY=2 \
MAX_STEPS=8 \
sbatch --nodes=2 --gpus-per-node=4 scripts/run_benchmark.slurm

Outputs

Main artifacts:

  • results/benchmark_runs.csv: one row per benchmark run
  • results/gpu_samples.csv: optional time-series GPU samples merged across nodes
  • results/artifacts/<run_id>/train_rank<global_rank>_<host>.log: upstream training stdout/stderr per rank under SLURM
  • results/artifacts/<run_id>/callback_metrics.json: callback-side timing summary, including both batch-timed and effective throughput
  • results/artifacts/<run_id>/gpu_samples_<run_id>_<host>.csv: host-local GPU samples
  • results/artifacts/<run_id>/monitor_<host>.json: host-local GPU summary
  • results/logs/<run_id>/composed_config.yaml: composed Hydra config from the benchmark logger

The run-level CSV stores:

  • run metadata
  • chunk configuration
  • override summary
  • start/end/wall time
  • completed train steps
  • iterations per second
  • per-GPU peak memory and utilization as JSON-encoded maps

Plotting

Create plots from the run CSV:

python scripts/plot_benchmark_results.py --csv-path results/benchmark_runs.csv

This writes plots into results/plots/, including:

  • batch-timed throughput by chunk configuration
  • effective throughput by chunk configuration
  • wall time by chunk configuration
  • peak GPU memory by chunk configuration
  • mean/max GPU utilization by chunk configuration

The two throughput metrics mean:

  • batch-timed throughput: num_train_steps_completed / timed_batch_seconds; this excludes most dataloader and between-batch overhead
  • effective throughput: num_train_steps_completed / fit_wall_time_s; this is closer to the training rate shown in Lightning progress output

Extending the Benchmark

Typical extension points:

  • change the HEALPix benchmark level with --level or LEVEL
  • change model size with --model-complexity or MODEL_COMPLEXITY; the benchmark uses att_dim = 512 * model_complexity
  • change chunk layouts with --chunk-label or custom chunk sizes
  • change trainer settings with wrapper flags or --override ...
  • change the benchmark model by editing benchmarking/configs/model/mg_transformer.yaml
  • change data sampling windows in benchmarking/configs/dataloader/zarr_healpix.yaml

If you want to benchmark a different FieldSpaceNN model family, keep the same wrapper pattern:

  • leave upstream code untouched if possible
  • add a benchmark-owned Hydra config
  • keep logging/checkpointing/validation overhead minimal
  • keep result writing inside this repo

Caveats

  • This benchmark is for performance statistics only. It does not compute accuracy or scientific skill metrics.
  • The default model config is intentionally small and short-running. It is meant to exercise the training stack, I/O path, and GPU behavior, not to produce useful climate predictions.
  • The explicit .zarr compatibility test succeeded with the xarray/zarr versions available during development, so no benchmark-side loader shim was added. If your site installs a materially different xarray stack and .zarr loading breaks, the first place to re-check is FieldSpaceNN’s BaseDataset.get_files() path.

About

Performance Benchmarking tool for FieldSpaceNN

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors