diff --git a/0-how-to-use-guide/README.md b/0-how-to-use-guide/README.md new file mode 100644 index 0000000..87c1cc6 --- /dev/null +++ b/0-how-to-use-guide/README.md @@ -0,0 +1,67 @@ +# 0. How to use this guide + +## Goal + +Start with a clear plan for running this guide on LUMI, including prerequisites, expected outcomes, and recommended lesson order. + +## Assumptions + +- You are new to this guide and want a reliable first path through it. +- You can access LUMI and have a project account with GPU allocation. +- You are comfortable with basic Linux shell commands. + +## Who this guide is for + +This guide is intended for users who want to move model training from smaller environments to LUMI with practical, runnable examples. + +## Prerequisites + +Before starting lesson 1, ensure you have: + +- a LUMI user account +- a project with GPU hours +- basic familiarity with Python and Slurm +- a working clone of this repository in `/project` or `/scratch` + +## Recommended learning path + +Follow this order for the core workflow (`1` to `6`): + +1. [1. QuickStart](../1-quickstart/README.md) +2. [2. Setting up your own environment](../2-setting-up-environment/README.md) +3. [3. File formats for training data](../3-file-formats/README.md) +4. [4. Data Storage Options](../4-data-storage/README.md) +5. [5. Multi-GPU and Multi-Node Training](../5-multi-gpu-and-node/README.md) +6. [6. Monitoring and Profiling jobs](../6-monitoring-and-profiling/README.md) + +After lesson `6`, the core path is complete. Continue with optional experiment-tracking modules as needed: + +- [7. TensorBoard visualization](../7-tensorboard-visualization/README.md) +- [8. MLflow visualization](../8-mlflow-visualization/README.md) +- [9. W&B visualization](../9-wandb-visualization/README.md) + +## Baseline conventions used across lessons + +- Keep a single stable container path in `env.sh`. +- Keep datasets in a consistent location under `resources/` unless a chapter explicitly changes it. +- Run batch jobs from the chapter directory so relative paths in scripts resolve correctly. + +## What success looks like + +By the end of the core workflow, you should be able to: + +- run a single-GPU training job +- scale to multi-GPU and multi-node runs +- choose storage and data formats intentionally +- profile runs and interpret bottlenecks + +## Troubleshooting + +- Job submission fails immediately: verify `--account` and project permissions. +- Container-related errors: confirm `env.sh` points to an existing `.sif`. +- Missing files at runtime: run commands from the intended chapter directory. + +## Navigation + +- Previous: [Guide Home](../README.md) +- Next: [1. QuickStart](../1-quickstart/README.md) diff --git a/1-quickstart/README.md b/1-quickstart/README.md index a012527..47e6558 100644 --- a/1-quickstart/README.md +++ b/1-quickstart/README.md @@ -1,84 +1,154 @@ # 1. QuickStart -This chapter covers how to set up the environment to run the [`visiontransformer.py`](visiontransformer.py) script on LUMI. +This lesson gives you a minimal, end-to-end first run of [`visiontransformer.py`](visiontransformer.py) on LUMI. -First, you clone this repository to LUMI via the following command: +## Goal + +Run one single-GPU training job on LUMI and confirm that: + +- the container works on GPU +- the squashfs extension is available +- training logs and a model checkpoint are produced + +## Assumptions + +- You have a LUMI project account and can submit jobs to `small-g`. +- The repository is cloned on `/project` or `/scratch` (not `$HOME`). +- `../env.sh` is configured for your environment and points to a valid container via `CONTAINER`. + +## Working directory + +Run commands in this chapter from: + +```bash +cd /path/to/LUMI-AI-Guide/1-quickstart +``` + +## Minimal run checkpoint + +Command: + +```bash +sbatch run_base.sh +``` + +Success signal: + +- Job output contains `SMOKE TEST PASSED`. + +Clone this repository to LUMI if needed: ```bash git clone https://github.com/Lumi-supercomputer/LUMI-AI-Guide.git ``` -We recommend using your `/project/` or `/scratch/` directory of your project to clone the repository as your home directory (`$HOME`) has a capacity of 20 GB and is intended to store user configuration files and personal data. +Use `/project` or `/scratch` for this clone. `$HOME` has limited capacity and is meant for configuration and personal files. -Next, navigate to the `LUMI-AI-Guide/quickstart` directory: +Then move to the lesson directory: ```bash cd LUMI-AI-Guide/1-quickstart ``` -We now need to setup the environment if we wish to run the included python scripts. We will use one of the provided PyTorch containers that we extend with additional packages (this step will be explained in more detail in the next chapter [Setting up your own environment](../2-setting-up-environment/README.md)). The fastest way to achieve this is to use the provided script `set_up_environment.sh`: +The recommended quickstart flow has three steps: + +1. Smoke-test the base container. +2. Build the squashfs extension. +3. Run the Vision Transformer training script with that extension. + +This keeps the runtime model consistent with the rest of the guide. + +## Step 1: Smoke test the base container + +Submit the base-container smoke test: ```bash -./set_up_environment.sh +sbatch run_base.sh ``` -If you receive a permission denied error, you can make the script executable by running: +Check the job output in: ```bash -chmod +x set_up_environment.sh +slurm-.out ``` -After the script has finished, you will see now two new files in the `LUMI-AI-Guide/resources/` directory. One is the training dataset in a `hdf5` file format (`train_images.hdf5`). The other one is the virtual python enviroment in a `sqfs` file format (`ai-guide-env.sqsh`). +The output file is written to the directory where you run `sbatch` (here: `1-quickstart/`). -For this example, we use the [Tiny ImageNet Dataset](https://paperswithcode.com/dataset/tiny-imagenet) which is already transformed into the file system friendly hdf5 format (Chapter [File formats for training data](../3-file-formats/README.md) explains in detail why this step is necessary). Please have a look at the terms of access for the ImageNet Dataset [here](https://www.image-net.org/download.php). +You should see Python/Torch/ROCm version info and `SMOKE TEST PASSED`. -To run the Vision Transformer example, we need to use a batch job script. We provide a batch job script [`run.sh`](run.sh) that you can use to run the [`visiontransformer.py`](visiontransformer.py) script on a single GPU on a LUMI-G node. -A quickstart to SLURM is provided in the [LUMI documentation](https://docs.lumi-supercomputer.eu/runjobs/scheduled-jobs/slurm-quickstart/). +## Step 2: Build the squashfs extension + +Build `visiontransformer-env.sqsh` from the base container: + +```bash +./build_visiontransformer_sqsh.sh +``` + +- `../resources/visiontransformer-env.sqsh` + +The extension includes packages needed by the sample scripts, such as `h5py`. -To run the provided script yourself, you need to replace the `--account` flag in line 2 of the [`run.sh`](run.sh) script with your own project account. You can find your project account by running the command `lumi-workspaces`. +## Step 3: Run Vision Transformer -After you have replaced the `--account` flag, you can submit the job to the LUMI scheduler by running: +Submit: ```bash -sbatch run.sh +sbatch run_vit.sh ``` -Once the job starts running, a `slurm-.out` file will be created in the `quickstart` directory. This file contains the output of the job and will be updated as the job progresses. +`run_vit.sh` uses: + +- `env.sh` for container selection +- `../resources/visiontransformer-env.sqsh` for the extended Python environment + +For quickstart, `run_vit.sh` uses `torchvision.datasets.FakeData` so the run is independent of dataset preparation. +Quickstart in this lesson uses FakeData only. +The reported metrics are smoke-test indicators and are not meaningful model-quality numbers. -*Note that we do a `module purge` at the beginning of the script. This will cause some warnings that some modules were not unloaded or could not be reloaded. -It is safe to ignore these warnings at this point.* +To run the Vision Transformer example, we use the batch job script [`run_vit.sh`](run_vit.sh), which runs [`visiontransformer.py`](visiontransformer.py) on a single GPU on a LUMI-G node. +A quickstart to SLURM is provided in the [LUMI documentation](https://docs.lumi-supercomputer.eu/runjobs/scheduled-jobs/slurm-quickstart/). + +If needed, replace the `--account` flag in [`run_base.sh`](run_base.sh) and [`run_vit.sh`](run_vit.sh) with your own project account. You can find your project account by running `lumi-workspaces`. + +Once the job starts, output is written to: + +```bash +slurm-.out +``` +As in Step 1, this is written in the submit directory (`1-quickstart/`) because the script uses Slurm's default output behavior. -The output will show Training Loss and Validation Accuracy values for each epoch, similar to the following: +The output will show smoke-test metrics for each epoch, similar to the following: ```bash -Training for 4 epochs in total and then saving trained model. +Quickstart mode: using FakeData with random labels. +Metrics are only for smoke testing runtime, not model quality. Starting epoch 1. -Epoch 1, Training Loss: 4.825212168216705 -Validation Accuracy: 8.95% +Epoch 1, Smoke Loss: 5.82 +Smoke Accuracy (chance~0.5%): 0.49% Starting epoch 2. -Epoch 2, Training Loss: 4.165826177024841 -Validation Accuracy: 14.41% -Starting epoch 3. -Epoch 3, Training Loss: 3.792399849319458 -Validation Accuracy: 18.18% -Starting epoch 4. -Epoch 4, Training Loss: 3.558482360649109 -Validation Accuracy: 20.945% -Saving model to vit_b_16_imagenet.pth +Epoch 2, Smoke Loss: 5.46 +Smoke Accuracy (chance~0.5%): 0.24% +... ``` -Congratulations! You have run your first training job on LUMI. The next chapter [Setting up your own environment](../2-setting-up-environment/README.md) will explain in more detail how the environment was set up and how you can set up your own environment for your AI projects on LUMI. +## Verify + +After the three steps, confirm all of the following: + +- Base smoke-test output includes `SMOKE TEST PASSED`. +- `../resources/visiontransformer-env.sqsh` exists. +- Training job output includes epoch-level loss and accuracy logs. +- `vit_b_16_imagenet.pth` is created in `1-quickstart/`. + +## Troubleshooting + +- Job fails at submission: update `--account` in [`run_base.sh`](run_base.sh) and [`run_vit.sh`](run_vit.sh), then check with `lumi-workspaces`. +- Container variable error (`Set CONTAINER in env.sh`): set `CONTAINER` in `../env.sh` to a valid `.sif`. +- Quickstart data: this lesson uses `FakeData`; no dataset files are required. +- No GPU visible in smoke test: ensure `module load singularity-AI-bindings` is present and rerun. - ### Table of contents +## Navigation -- [Home](..#readme) -- [1. QuickStart](https://github.com/Lumi-supercomputer/LUMI-AI-Guide/tree/main/1-quickstart#readme) -- [2. Setting up your own environment](https://github.com/Lumi-supercomputer/LUMI-AI-Guide/tree/main/2-setting-up-environment#readme) -- [3. File formats for training data](https://github.com/Lumi-supercomputer/LUMI-AI-Guide/tree/main/3-file-formats#readme) -- [4. Data Storage Options](https://github.com/Lumi-supercomputer/LUMI-AI-Guide/tree/main/4-data-storage#readme) -- [5. Multi-GPU and Multi-Node Training](https://github.com/Lumi-supercomputer/LUMI-AI-Guide/tree/main/5-multi-gpu-and-node#readme) -- [6. Monitoring and Profiling jobs](https://github.com/Lumi-supercomputer/LUMI-AI-Guide/tree/main/6-monitoring-and-profiling#readme) -- [7. TensorBoard visualization](https://github.com/Lumi-supercomputer/LUMI-AI-Guide/tree/main/7-TensorBoard-visualization#readme) -- [8. MLflow visualization](https://github.com/Lumi-supercomputer/LUMI-AI-Guide/tree/main/8-MLflow-visualization#readme) -- [9. Wandb visualization](https://github.com/Lumi-supercomputer/LUMI-AI-Guide/tree/main/9-Wandb-visualization#readme) +- Previous: [0. How to use this guide](../0-how-to-use-guide/README.md) +- Next: [2. Setting up your own environment](../2-setting-up-environment/README.md) diff --git a/1-quickstart/build_visiontransformer_sqsh.sh b/1-quickstart/build_visiontransformer_sqsh.sh new file mode 100755 index 0000000..77bc2c9 --- /dev/null +++ b/1-quickstart/build_visiontransformer_sqsh.sh @@ -0,0 +1,29 @@ +#!/bin/bash +set -euo pipefail + +module use /appl/local/containers/ai-modules +module load singularity-AI-bindings + +source ../env.sh + +: "${CONTAINER:?Set CONTAINER in env.sh}" + +OUT_SQSH="../resources/visiontransformer-env.sqsh" +BUILD_DIR="$SCRATCH_ROOT/visiontransformer-env-build" + +mkdir -p ../resources +rm -rf "$BUILD_DIR" +mkdir -p "$BUILD_DIR" + +echo "Building extension environment in: $BUILD_DIR" +singularity exec -B "$BUILD_DIR":/user-software "$CONTAINER" bash -c ' +set -euo pipefail +python -m venv /user-software --system-site-packages +/user-software/bin/python -m pip install h5py lmdb msgpack six tqdm mlflow==2.22.0 +' + +rm -f "$OUT_SQSH" +echo "Creating squashfs image: $OUT_SQSH" +mksquashfs "$BUILD_DIR" "$OUT_SQSH" -noappend -comp gzip -no-xattrs + +echo "Done: $OUT_SQSH" diff --git a/1-quickstart/run_base.sh b/1-quickstart/run_base.sh new file mode 100755 index 0000000..71ea345 --- /dev/null +++ b/1-quickstart/run_base.sh @@ -0,0 +1,34 @@ +#!/bin/bash + +#SBATCH --job-name=quickstart-base +#SBATCH --account=project_462000131 +#SBATCH --partition=small-g + +#SBATCH --nodes=1 +#SBATCH --gpus-per-node=1 +#SBATCH --ntasks-per-node=1 +#SBATCH --cpus-per-task=7 +#SBATCH --mem-per-gpu=60G + +#SBATCH --time=00:15:00 + +module use /appl/local/containers/ai-modules +module load singularity-AI-bindings + +source ../env.sh + +: "${CONTAINER:?Set CONTAINER in env.sh}" + +singularity exec "$CONTAINER" python - <<'PY' +import platform +import torch + +print(f"python={platform.python_version()}") +print(f"torch={torch.__version__}") +print(f"rocm={torch.version.hip}") +print(f"cuda_available={torch.cuda.is_available()}") +print(f"device_count={torch.cuda.device_count() if torch.cuda.is_available() else 0}") + +assert torch.cuda.is_available(), "GPU not visible in container" +print("SMOKE TEST PASSED") +PY diff --git a/1-quickstart/run_vit.sh b/1-quickstart/run_vit.sh new file mode 100644 index 0000000..742e561 --- /dev/null +++ b/1-quickstart/run_vit.sh @@ -0,0 +1,23 @@ +#!/bin/bash + +#SBATCH --job-name=quickstart-vit +#SBATCH --account=project_462000131 +#SBATCH --partition=small-g + +#SBATCH --nodes=1 +#SBATCH --gpus-per-node=1 +#SBATCH --ntasks-per-node=1 +#SBATCH --cpus-per-task=7 +#SBATCH --mem-per-gpu=60G + +#SBATCH --time=01:00:00 + +set -euo pipefail + +module use /appl/local/containers/ai-modules +module load singularity-AI-bindings + +source ../env.sh + +: "${CONTAINER:?Set CONTAINER in env.sh}" +singularity exec -B ../resources/visiontransformer-env.sqsh:/user-software:image-src=/ "$CONTAINER" /user-software/bin/python visiontransformer.py diff --git a/README.md b/README.md index 2045a8c..623177e 100644 --- a/README.md +++ b/README.md @@ -1,39 +1,47 @@ -# LUMI AI guide +# LUMI AI Guide -This guide is designed to assist users in migrating their machine learning applications from smaller-scale computing environments to the LUMI supercomputer. We will walk you through a detailed example of training an image classification model using [PyTorch's Vision Transformer (VIT)](https://pytorch.org/vision/main/models/vision_transformer.html) on the [ImageNet dataset](https://www.image-net.org/). +This repository is a practical guide for moving machine learning training workloads to LUMI using a runnable Vision Transformer example in PyTorch. -All Python and bash scripts referenced in this guide are accessible in this [GitHub repository](https://github.com/Lumi-supercomputer/LUMI-AI-example/tree/main). We start with a basic python script, [visiontransformer.py](1-quickstart/visiontransformer.py), that could run on your local machine and modify it over the next chapters to run it efficiently on LUMI. +All Python and shell scripts referenced in this guide are part of this repository: [LUMI-AI-Guide](https://github.com/Lumi-supercomputer/LUMI-AI-Guide/tree/main). The workflow starts from [`1-quickstart/visiontransformer.py`](1-quickstart/visiontransformer.py) and scales up chapter by chapter. -Even though this guide uses PyTorch, most of the covered topics are independent of the used machine learning framework. We therefore believe this guide is helpful for all new ML users on LUMI while also providing a concrete example that runs on LUMI. +## Goal -> [!IMPORTANT] -> PyTorch containers on LUMI will in the future be provided by the [LUMI AI Factory](lumi-ai-factory.eu). -> This guide, apart from [3. File formats for training data](https://github.com/Lumi-supercomputer/LUMI-AI-Guide/tree/main/3-file-formats#readme), was updated to utilize these new containers. The containers currently referenced in that part remain available on LUMI, but will no longer receive updates. However, all examples included in this guide will continue to work as they currently do. -> For more information about the new containers, refer to the [LUMI AI Factory AI Software Environment documentation](https://docs.lumi-supercomputer.eu/laif/software/ai-environment/). +Provide a clear core path from first single-GPU execution to distributed training, then optional experiment-tracking extensions on LUMI. -### Requirements +## Requirements -Before proceeding, please ensure you meet the following prerequisites: +Before starting, ensure you have: -* A basic understanding of machine learning concepts and Python programming. This guide will focus primarily on aspects specific to training models on LUMI. -* An active user account on LUMI and familiarity with its basic operations. -* If you wish to run the included examples, you need to be part of a project with GPU hours on LUMI. +- basic familiarity with Python and machine learning workflows +- a LUMI user account and basic command-line/Slurm usage +- a project with available GPU hours if you want to run the examples -### Table of contents +## Important account note -The guide is structured into the following sections: +This guide keeps `#SBATCH --account=...` lines inside job scripts for simple copy-run commands. +Before submitting jobs, replace those account values with your own LUMI project account. -- [1. QuickStart](https://github.com/Lumi-supercomputer/LUMI-AI-Guide/tree/main/1-quickstart#readme) -- [2. Setting up your own environment](https://github.com/Lumi-supercomputer/LUMI-AI-Guide/tree/main/2-setting-up-environment#readme) -- [3. File formats for training data](https://github.com/Lumi-supercomputer/LUMI-AI-Guide/tree/main/3-file-formats#readme) -- [4. Data Storage Options](https://github.com/Lumi-supercomputer/LUMI-AI-Guide/tree/main/4-data-storage#readme) -- [5. Multi-GPU and Multi-Node Training](https://github.com/Lumi-supercomputer/LUMI-AI-Guide/tree/main/5-multi-gpu-and-node#readme) -- [6. Monitoring and Profiling jobs](https://github.com/Lumi-supercomputer/LUMI-AI-Guide/tree/main/6-monitoring-and-profiling#readme) -- [7. TensorBoard visualization](https://github.com/Lumi-supercomputer/LUMI-AI-Guide/tree/main/7-TensorBoard-visualization#readme) -- [8. MLflow visualization](https://github.com/Lumi-supercomputer/LUMI-AI-Guide/tree/main/8-MLflow-visualization#readme) -- [9. Wandb visualization](https://github.com/Lumi-supercomputer/LUMI-AI-Guide/tree/main/9-Wandb-visualization#readme) - -### Further reading +## Start here + +Begin with [0. How to use this guide](0-how-to-use-guide/README.md). Then follow core lessons `1` to `6`, and continue with optional lessons `7` to `9` if needed. + +## Core lessons (recommended order) + +- [0. How to use this guide](0-how-to-use-guide/README.md) +- [1. QuickStart](1-quickstart/README.md) +- [2. Setting up your own environment](2-setting-up-environment/README.md) +- [3. File formats for training data](3-file-formats/README.md) +- [4. Data Storage Options](4-data-storage/README.md) +- [5. Multi-GPU and Multi-Node Training](5-multi-gpu-and-node/README.md) +- [6. Monitoring and Profiling jobs](6-monitoring-and-profiling/README.md) + +## Optional lessons (experiment tracking) + +- [7. TensorBoard visualization](7-tensorboard-visualization/README.md) +- [8. MLflow visualization](8-mlflow-visualization/README.md) +- [9. W&B visualization](9-wandb-visualization/README.md) + +## Further reading - [LUMI Documentation](https://docs.lumi-supercomputer.eu/) - [LUMI AI Factory Services](https://docs.lumi-supercomputer.eu/software/local/lumi-aif/) @@ -41,3 +49,7 @@ The guide is structured into the following sections: - [LUMI software library, TensorFlow](https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/t/TensorFlow/) - [LUMI software library, Jax](https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/j/jax/) - [Workshop material - Moving your AI training jobs to LUMI](https://lumi-supercomputer.github.io/LUMI-training-materials/ai-20240529/) + +## Navigation + +- Next: [0. How to use this guide](0-how-to-use-guide/README.md) diff --git a/env.sh b/env.sh new file mode 100644 index 0000000..dd12ce3 --- /dev/null +++ b/env.sh @@ -0,0 +1,24 @@ +#!/bin/bash + +# Shared configuration for LUMI runs across the repo. +# Override any value by exporting it before sourcing this file. + +export REPO_ROOT="/project/project_462000131/anisrahm/LUMI-AI-Guide" + +export PROJECT_ACCOUNT="${PROJECT_ACCOUNT:-project_462000131}" +export LUMI_USER="${LUMI_USER:-${USER:-anisrahm}}" +export CONTAINER="${CONTAINER:-/appl/local/laifs/containers/lumi-multitorch-u24r64f21m43t29-20260225_144743/lumi-multitorch-full-u24r64f21m43t29-20260225_144743.sif}" + +export PROJECT_ROOT="${PROJECT_ROOT:-/project/${PROJECT_ACCOUNT}/${LUMI_USER}}" +export SCRATCH_ROOT="${SCRATCH_ROOT:-/scratch/${PROJECT_ACCOUNT}/${LUMI_USER}}" + +export DATA_PROJECT_DIR="${DATA_PROJECT_DIR:-${SCRATCH_ROOT}/file-format-ai-benchmark}" +export DATA_BENCH_DIR="${DATA_BENCH_DIR:-${SCRATCH_ROOT}/file-format-ai-benchmark}" + +export SQUASH_LARGE="${SQUASH_LARGE:-${DATA_BENCH_DIR}/imagenet-object-localization-challenge.squashfs}" +export LMDB_LARGE="${LMDB_LARGE:-${DATA_BENCH_DIR}/imagenet-object-localization-challenge.lmdb}" + +export TINY_LMDB_PATH="${TINY_LMDB_PATH:-${DATA_PROJECT_DIR}/data-formats/lmdb/train_images}" +export TINY_HDF5_PATH="${TINY_HDF5_PATH:-${DATA_PROJECT_DIR}/data-formats/hdf5/train_images.hdf5}" + +export IMAGENET_ZIP_DIR="${IMAGENET_ZIP_DIR:-${DATA_BENCH_DIR}/}"