FieldSpaceNN Benchmark

This repository is a wrapper-only performance benchmark for FieldSpaceNN. It measures runtime and hardware behavior for short HEALPix training runs and intentionally does not evaluate model accuracy.

What This Repo Does

Creates a local venv-based environment and installs FieldSpaceNN from GitHub.
Generates deterministic synthetic HEALPix tas data in Zarr format under data/, defaulting to level 10.
Launches FieldSpaceNN training through the upstream entrypoint with benchmark-owned Hydra configs.
Collects iterations per second, end-to-end wall time, GPU memory, and GPU utilization.
Writes CSV outputs and plots benchmark summaries with matplotlib.

Upstream Inspection Notes

This benchmark was designed against the current upstream FieldSpaceNN repository at commit d8eac2988e305f1fd5d6192940f08235bca0ec26.

Key findings from the inspection:

The public training entrypoint is still: python -m fieldspacenn.src.train -cp fieldspacenn/configs -cn mg_transformer_train
The upstream train.py instantiates cfg.logger, cfg.model, cfg.trainer, and cfg.dataloader.datamodule directly from Hydra, so callback injection and logger replacement work cleanly from this repo.
The stock mg_transformer_train config is level-6 oriented: model.in_zooms: [3, 5, 6] mgrids.zoom_max: 6
The stock healpix loader path uses fieldspacenn.src.data.datasets_healpix.HealPixLoader, which inherits file loading from BaseDataset.get_files(). That path uses xr.open_dataset() / xr.load_dataset() even for .zarr inputs.
I explicitly tested .zarr loading with xarray 2023.4.2 and zarr 2.14.2 in this environment: xr.open_dataset("...zarr"), xr.load_dataset("...zarr"), and xr.open_zarr("...zarr") all worked.

Because of those findings, this repo keeps the benchmark logic outside FieldSpaceNN itself:

benchmark-owned Hydra configs live in benchmarking/configs/
a benchmark-side Lightning callback records iteration timing
a benchmark-side logger disables snapshot/image logging
the wrapper launches the upstream training module as a subprocess

No FieldSpaceNN source changes are required for the default benchmark path.

Repository Layout

pyproject.toml: benchmark package metadata and benchmark-only Python dependencies
scripts/setup_env.sh: local venv setup script
scripts/generate_synthetic_healpix_zarr.py: CLI wrapper for synthetic data generation
scripts/run_level8_compare.slurm: main SLURM workflow that compares full_cell vs time_major at complexity 1 and 4, then plots results
scripts/run_benchmark.slurm: SLURM launcher for single-node or multi-node runs
scripts/plot_benchmark_results.py: matplotlib plotting script
benchmarking/run_benchmark.py: wrapper launcher around upstream FieldSpaceNN training
benchmarking/monitor.py: GPU monitoring via pynvml with nvidia-smi fallback
benchmarking/callbacks.py: benchmark timing callback injected through Hydra
benchmarking/local_logger.py: no-op/local logger used to avoid Weights & Biases by default
benchmarking/configs/: benchmark-owned Hydra config tree
data/: synthetic benchmark data and normalization metadata
results/: benchmark CSV outputs, logs, and plots

Setup

Run the setup script from this repository root:

bash scripts/setup_env.sh

The script:

loads git and python3 modules if available
creates .venv
upgrades pip, setuptools, and wheel
installs this benchmark repo
installs torch==2.4.0, torchvision==0.19.0, and torchaudio==2.4.0 from the CUDA 12.1 PyTorch wheel index by default
installs FieldSpaceNN from GitHub

Useful environment overrides:

PYTORCH_INDEX_URL
FIELDSPACENN_REF
VENV_DIR

Main Workflow

The main intended workflow is a SLURM submission that runs four benchmarks on the same level-8, 4096-timestep dataset shape:

full_cell, complexity 1
full_cell, complexity 4
time_major, complexity 1
time_major, complexity 4

After all four runs finish, the job generates plots from the benchmark CSV.

Submit it from the repository root with:

sbatch scripts/run_level8_compare.slurm

The script defaults to:

LEVEL=8
TIMESTEPS=4096
chunk/complexity combinations: full_cell,c1, full_cell,c4, time_major,c1, time_major,c4
plots written from results/benchmark_runs.csv into results/plots/

Useful overrides for this workflow:

RUN_PREFIX: prefix for the four run ids, which become <prefix>-full_cell-c1, <prefix>-full_cell-c4, <prefix>-time_major-c1, and <prefix>-time_major-c4
MAX_STEPS
BATCH_SIZE
DATALOADER_WORKERS
PRECISION
PLOT_CSV_PATH
PLOT_OUTPUT_DIR
BENCHMARK_EXTRA_ARGS

Example:

RUN_PREFIX=chunk-compare MAX_STEPS=5000 sbatch scripts/run_level8_compare.slurm

Synthetic Data Generation

Generate the default small dataset:

source .venv/bin/activate
python scripts/generate_synthetic_healpix_zarr.py --chunk-label time_major

This writes:

a Zarr store such as data/tas_level10_t6_time_major.zarr
normalization metadata at data/input/norm_dict.json

The benchmark uses the requested top zoom plus three coarser zooms with step size 2. Example: --level 8 resolves benchmark zooms 8, 6, 4, 2.

Dataset characteristics:

variable: tas
dtype: float32
dimensions: time, cell
time coordinate: realistic Unix timestamps so upstream TimeEmbedder receives values in the same range as the stock config
spatial coordinate: cell index only

The generator is deterministic and chunk-aware. Default chunk presets are:

time_major: (time=1, cell=262144)
balanced: (time=2, cell=131072)
spatial_heavy: (time=1, cell=1048576)
full_cell: (time=1, cell=12582912)
full_time: (time=all timesteps, cell=npix // 64)

Custom chunking is also supported:

python scripts/generate_synthetic_healpix_zarr.py \
  --chunk-label custom \
  --time-chunk 2 \
  --cell-chunk 524288 \
  --force

Local Benchmark Runs

The wrapper defaults to a short performance-only run:

level-10 HEALPix synthetic tas
four-zoom benchmark model config derived from the requested top zoom
no checkpointing
no validation loop during the short run
no external logging backend

Smoke test example:

python -m benchmarking.run_benchmark --smoke-test --accelerator cpu

GPU example:

python -m benchmarking.run_benchmark \
  --accelerator gpu \
  --level 8 \
  --model-complexity 2 \
  --batch-size 1 \
  --num-workers 8 \
  --chunk-label balanced \
  --max-steps 8

Passing raw Hydra overrides through the wrapper:

python -m benchmarking.run_benchmark \
  --chunk-label time_major \
  --override trainer.precision=bf16-mixed \
  --override model.lr_groups.default.lr=1e-4

SLURM Runs

Submit with default header values:

sbatch scripts/run_benchmark.slurm

Override scheduler resources at submission time:

sbatch \
  --account=<account> \
  --partition=<partition> \
  --time=01:00:00 \
  --nodes=2 \
  --gpus-per-node=4 \
  --cpus-per-task=16 \
  --mem=64G \
  scripts/run_benchmark.slurm

Runtime knobs are exposed through environment variables:

RUN_ID
TRAINER_ACCELERATOR
TRAINER_NUM_NODES
GPUS_PER_NODE
TRAINER_STRATEGY
DATALOADER_WORKERS
BATCH_SIZE
CHUNK_LABEL
LEVEL
TIMESTEPS
MODEL_COMPLEXITY
MAX_STEPS
SAMPLE_INTERVAL_S
PRECISION
BENCHMARK_EXTRA_ARGS

Example multi-node submission:

RUN_ID=slurm-test \
TRAINER_NUM_NODES=2 \
GPUS_PER_NODE=4 \
TRAINER_STRATEGY=ddp \
CHUNK_LABEL=spatial_heavy \
MODEL_COMPLEXITY=2 \
MAX_STEPS=8 \
sbatch --nodes=2 --gpus-per-node=4 scripts/run_benchmark.slurm

Outputs

Main artifacts:

results/benchmark_runs.csv: one row per benchmark run
results/gpu_samples.csv: optional time-series GPU samples merged across nodes
results/artifacts/<run_id>/train_rank<global_rank>_<host>.log: upstream training stdout/stderr per rank under SLURM
results/artifacts/<run_id>/callback_metrics.json: callback-side timing summary, including both batch-timed and effective throughput
results/artifacts/<run_id>/gpu_samples_<run_id>_<host>.csv: host-local GPU samples
results/artifacts/<run_id>/monitor_<host>.json: host-local GPU summary
results/logs/<run_id>/composed_config.yaml: composed Hydra config from the benchmark logger

The run-level CSV stores:

run metadata
chunk configuration
override summary
start/end/wall time
completed train steps
iterations per second
per-GPU peak memory and utilization as JSON-encoded maps

Plotting

Create plots from the run CSV:

python scripts/plot_benchmark_results.py --csv-path results/benchmark_runs.csv

This writes plots into results/plots/, including:

batch-timed throughput by chunk configuration
effective throughput by chunk configuration
wall time by chunk configuration
peak GPU memory by chunk configuration
mean/max GPU utilization by chunk configuration

The two throughput metrics mean:

batch-timed throughput: num_train_steps_completed / timed_batch_seconds; this excludes most dataloader and between-batch overhead
effective throughput: num_train_steps_completed / fit_wall_time_s; this is closer to the training rate shown in Lightning progress output

Extending the Benchmark

Typical extension points:

change the HEALPix benchmark level with --level or LEVEL
change model size with --model-complexity or MODEL_COMPLEXITY; the benchmark uses att_dim = 512 * model_complexity
change chunk layouts with --chunk-label or custom chunk sizes
change trainer settings with wrapper flags or --override ...
change the benchmark model by editing benchmarking/configs/model/mg_transformer.yaml
change data sampling windows in benchmarking/configs/dataloader/zarr_healpix.yaml

If you want to benchmark a different FieldSpaceNN model family, keep the same wrapper pattern:

leave upstream code untouched if possible
add a benchmark-owned Hydra config
keep logging/checkpointing/validation overhead minimal
keep result writing inside this repo

Caveats

This benchmark is for performance statistics only. It does not compute accuracy or scientific skill metrics.
The default model config is intentionally small and short-running. It is meant to exercise the training stack, I/O path, and GPU behavior, not to produce useful climate predictions.
The explicit .zarr compatibility test succeeded with the xarray/zarr versions available during development, so no benchmark-side loader shim was added. If your site installs a materially different xarray stack and .zarr loading breaks, the first place to re-check is FieldSpaceNN’s BaseDataset.get_files() path.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FieldSpaceNN Benchmark

What This Repo Does

Upstream Inspection Notes

Repository Layout

Setup

Main Workflow

Synthetic Data Generation

Local Benchmark Runs

SLURM Runs

Outputs

Plotting

Extending the Benchmark

Caveats

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
benchmarking		benchmarking
data		data
results		results
scripts		scripts
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

FieldSpaceNN Benchmark

What This Repo Does

Upstream Inspection Notes

Repository Layout

Setup

Main Workflow

Synthetic Data Generation

Local Benchmark Runs

SLURM Runs

Outputs

Plotting

Extending the Benchmark

Caveats

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages