Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 67 additions & 0 deletions 0-how-to-use-guide/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# 0. How to use this guide

## Goal

Start with a clear plan for running this guide on LUMI, including prerequisites, expected outcomes, and recommended lesson order.

## Assumptions

- You are new to this guide and want a reliable first path through it.
- You can access LUMI and have a project account with GPU allocation.
- You are comfortable with basic Linux shell commands.

## Who this guide is for

This guide is intended for users who want to move model training from smaller environments to LUMI with practical, runnable examples.

## Prerequisites

Before starting lesson 1, ensure you have:

- a LUMI user account
- a project with GPU hours
- basic familiarity with Python and Slurm
- a working clone of this repository in `/project` or `/scratch`

## Recommended learning path

Follow this order for the core workflow (`1` to `6`):

1. [1. QuickStart](../1-quickstart/README.md)
2. [2. Setting up your own environment](../2-setting-up-environment/README.md)
3. [3. File formats for training data](../3-file-formats/README.md)
4. [4. Data Storage Options](../4-data-storage/README.md)
5. [5. Multi-GPU and Multi-Node Training](../5-multi-gpu-and-node/README.md)
6. [6. Monitoring and Profiling jobs](../6-monitoring-and-profiling/README.md)

After lesson `6`, the core path is complete. Continue with optional experiment-tracking modules as needed:

- [7. TensorBoard visualization](../7-tensorboard-visualization/README.md)
- [8. MLflow visualization](../8-mlflow-visualization/README.md)
- [9. W&B visualization](../9-wandb-visualization/README.md)

## Baseline conventions used across lessons

- Keep a single stable container path in `env.sh`.
- Keep datasets in a consistent location under `resources/` unless a chapter explicitly changes it.
- Run batch jobs from the chapter directory so relative paths in scripts resolve correctly.

## What success looks like

By the end of the core workflow, you should be able to:

- run a single-GPU training job
- scale to multi-GPU and multi-node runs
- choose storage and data formats intentionally
- profile runs and interpret bottlenecks

## Troubleshooting

- Job submission fails immediately: verify `--account` and project permissions.
- Container-related errors: confirm `env.sh` points to an existing `.sif`.
- Missing files at runtime: run commands from the intended chapter directory.

## Navigation

- Previous: [Guide Home](../README.md)
- Next: [1. QuickStart](../1-quickstart/README.md)
156 changes: 113 additions & 43 deletions 1-quickstart/README.md
Original file line number Diff line number Diff line change
@@ -1,84 +1,154 @@
# 1. QuickStart

This chapter covers how to set up the environment to run the [`visiontransformer.py`](visiontransformer.py) script on LUMI.
This lesson gives you a minimal, end-to-end first run of [`visiontransformer.py`](visiontransformer.py) on LUMI.

First, you clone this repository to LUMI via the following command:
## Goal

Run one single-GPU training job on LUMI and confirm that:

- the container works on GPU
- the squashfs extension is available
- training logs and a model checkpoint are produced

## Assumptions

- You have a LUMI project account and can submit jobs to `small-g`.
- The repository is cloned on `/project` or `/scratch` (not `$HOME`).
- `../env.sh` is configured for your environment and points to a valid container via `CONTAINER`.

## Working directory

Run commands in this chapter from:

```bash
cd /path/to/LUMI-AI-Guide/1-quickstart
```

## Minimal run checkpoint

Command:

```bash
sbatch run_base.sh
```

Success signal:

- Job output contains `SMOKE TEST PASSED`.

Clone this repository to LUMI if needed:

```bash
git clone https://github.com/Lumi-supercomputer/LUMI-AI-Guide.git
```

We recommend using your `/project/` or `/scratch/` directory of your project to clone the repository as your home directory (`$HOME`) has a capacity of 20 GB and is intended to store user configuration files and personal data.
Use `/project` or `/scratch` for this clone. `$HOME` has limited capacity and is meant for configuration and personal files.

Next, navigate to the `LUMI-AI-Guide/quickstart` directory:
Then move to the lesson directory:

```bash
cd LUMI-AI-Guide/1-quickstart
```

We now need to setup the environment if we wish to run the included python scripts. We will use one of the provided PyTorch containers that we extend with additional packages (this step will be explained in more detail in the next chapter [Setting up your own environment](../2-setting-up-environment/README.md)). The fastest way to achieve this is to use the provided script `set_up_environment.sh`:
The recommended quickstart flow has three steps:

1. Smoke-test the base container.
2. Build the squashfs extension.
3. Run the Vision Transformer training script with that extension.

This keeps the runtime model consistent with the rest of the guide.

## Step 1: Smoke test the base container

Submit the base-container smoke test:

```bash
./set_up_environment.sh
sbatch run_base.sh
```

If you receive a permission denied error, you can make the script executable by running:
Check the job output in:

```bash
chmod +x set_up_environment.sh
slurm-<jobid>.out
```

After the script has finished, you will see now two new files in the `LUMI-AI-Guide/resources/` directory. One is the training dataset in a `hdf5` file format (`train_images.hdf5`). The other one is the virtual python enviroment in a `sqfs` file format (`ai-guide-env.sqsh`).
The output file is written to the directory where you run `sbatch` (here: `1-quickstart/`).

For this example, we use the [Tiny ImageNet Dataset](https://paperswithcode.com/dataset/tiny-imagenet) which is already transformed into the file system friendly hdf5 format (Chapter [File formats for training data](../3-file-formats/README.md) explains in detail why this step is necessary). Please have a look at the terms of access for the ImageNet Dataset [here](https://www.image-net.org/download.php).
You should see Python/Torch/ROCm version info and `SMOKE TEST PASSED`.

To run the Vision Transformer example, we need to use a batch job script. We provide a batch job script [`run.sh`](run.sh) that you can use to run the [`visiontransformer.py`](visiontransformer.py) script on a single GPU on a LUMI-G node.
A quickstart to SLURM is provided in the [LUMI documentation](https://docs.lumi-supercomputer.eu/runjobs/scheduled-jobs/slurm-quickstart/).
## Step 2: Build the squashfs extension

Build `visiontransformer-env.sqsh` from the base container:

```bash
./build_visiontransformer_sqsh.sh
```

- `../resources/visiontransformer-env.sqsh`

The extension includes packages needed by the sample scripts, such as `h5py`.

To run the provided script yourself, you need to replace the `--account` flag in line 2 of the [`run.sh`](run.sh) script with your own project account. You can find your project account by running the command `lumi-workspaces`.
## Step 3: Run Vision Transformer

After you have replaced the `--account` flag, you can submit the job to the LUMI scheduler by running:
Submit:

```bash
sbatch run.sh
sbatch run_vit.sh
```

Once the job starts running, a `slurm-<jobid>.out` file will be created in the `quickstart` directory. This file contains the output of the job and will be updated as the job progresses.
`run_vit.sh` uses:

- `env.sh` for container selection
- `../resources/visiontransformer-env.sqsh` for the extended Python environment

For quickstart, `run_vit.sh` uses `torchvision.datasets.FakeData` so the run is independent of dataset preparation.
Quickstart in this lesson uses FakeData only.
The reported metrics are smoke-test indicators and are not meaningful model-quality numbers.

*Note that we do a `module purge` at the beginning of the script. This will cause some warnings that some modules were not unloaded or could not be reloaded.
It is safe to ignore these warnings at this point.*
To run the Vision Transformer example, we use the batch job script [`run_vit.sh`](run_vit.sh), which runs [`visiontransformer.py`](visiontransformer.py) on a single GPU on a LUMI-G node.
A quickstart to SLURM is provided in the [LUMI documentation](https://docs.lumi-supercomputer.eu/runjobs/scheduled-jobs/slurm-quickstart/).

If needed, replace the `--account` flag in [`run_base.sh`](run_base.sh) and [`run_vit.sh`](run_vit.sh) with your own project account. You can find your project account by running `lumi-workspaces`.

Once the job starts, output is written to:

```bash
slurm-<jobid>.out
```

As in Step 1, this is written in the submit directory (`1-quickstart/`) because the script uses Slurm's default output behavior.

The output will show Training Loss and Validation Accuracy values for each epoch, similar to the following:
The output will show smoke-test metrics for each epoch, similar to the following:

```bash
Training for 4 epochs in total and then saving trained model.
Quickstart mode: using FakeData with random labels.
Metrics are only for smoke testing runtime, not model quality.
Starting epoch 1.
Epoch 1, Training Loss: 4.825212168216705
Validation Accuracy: 8.95%
Epoch 1, Smoke Loss: 5.82
Smoke Accuracy (chance~0.5%): 0.49%
Starting epoch 2.
Epoch 2, Training Loss: 4.165826177024841
Validation Accuracy: 14.41%
Starting epoch 3.
Epoch 3, Training Loss: 3.792399849319458
Validation Accuracy: 18.18%
Starting epoch 4.
Epoch 4, Training Loss: 3.558482360649109
Validation Accuracy: 20.945%
Saving model to vit_b_16_imagenet.pth
Epoch 2, Smoke Loss: 5.46
Smoke Accuracy (chance~0.5%): 0.24%
...
```

Congratulations! You have run your first training job on LUMI. The next chapter [Setting up your own environment](../2-setting-up-environment/README.md) will explain in more detail how the environment was set up and how you can set up your own environment for your AI projects on LUMI.
## Verify

After the three steps, confirm all of the following:

- Base smoke-test output includes `SMOKE TEST PASSED`.
- `../resources/visiontransformer-env.sqsh` exists.
- Training job output includes epoch-level loss and accuracy logs.
- `vit_b_16_imagenet.pth` is created in `1-quickstart/`.

## Troubleshooting

- Job fails at submission: update `--account` in [`run_base.sh`](run_base.sh) and [`run_vit.sh`](run_vit.sh), then check with `lumi-workspaces`.
- Container variable error (`Set CONTAINER in env.sh`): set `CONTAINER` in `../env.sh` to a valid `.sif`.
- Quickstart data: this lesson uses `FakeData`; no dataset files are required.
- No GPU visible in smoke test: ensure `module load singularity-AI-bindings` is present and rerun.

### Table of contents
## Navigation

- [Home](..#readme)
- [1. QuickStart](https://github.com/Lumi-supercomputer/LUMI-AI-Guide/tree/main/1-quickstart#readme)
- [2. Setting up your own environment](https://github.com/Lumi-supercomputer/LUMI-AI-Guide/tree/main/2-setting-up-environment#readme)
- [3. File formats for training data](https://github.com/Lumi-supercomputer/LUMI-AI-Guide/tree/main/3-file-formats#readme)
- [4. Data Storage Options](https://github.com/Lumi-supercomputer/LUMI-AI-Guide/tree/main/4-data-storage#readme)
- [5. Multi-GPU and Multi-Node Training](https://github.com/Lumi-supercomputer/LUMI-AI-Guide/tree/main/5-multi-gpu-and-node#readme)
- [6. Monitoring and Profiling jobs](https://github.com/Lumi-supercomputer/LUMI-AI-Guide/tree/main/6-monitoring-and-profiling#readme)
- [7. TensorBoard visualization](https://github.com/Lumi-supercomputer/LUMI-AI-Guide/tree/main/7-TensorBoard-visualization#readme)
- [8. MLflow visualization](https://github.com/Lumi-supercomputer/LUMI-AI-Guide/tree/main/8-MLflow-visualization#readme)
- [9. Wandb visualization](https://github.com/Lumi-supercomputer/LUMI-AI-Guide/tree/main/9-Wandb-visualization#readme)
- Previous: [0. How to use this guide](../0-how-to-use-guide/README.md)
- Next: [2. Setting up your own environment](../2-setting-up-environment/README.md)
29 changes: 29 additions & 0 deletions 1-quickstart/build_visiontransformer_sqsh.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
#!/bin/bash
set -euo pipefail

module use /appl/local/containers/ai-modules
module load singularity-AI-bindings

source ../env.sh

: "${CONTAINER:?Set CONTAINER in env.sh}"

OUT_SQSH="../resources/visiontransformer-env.sqsh"
BUILD_DIR="$SCRATCH_ROOT/visiontransformer-env-build"

mkdir -p ../resources
rm -rf "$BUILD_DIR"
mkdir -p "$BUILD_DIR"

echo "Building extension environment in: $BUILD_DIR"
singularity exec -B "$BUILD_DIR":/user-software "$CONTAINER" bash -c '
set -euo pipefail
python -m venv /user-software --system-site-packages
/user-software/bin/python -m pip install h5py lmdb msgpack six tqdm mlflow==2.22.0
'

rm -f "$OUT_SQSH"
echo "Creating squashfs image: $OUT_SQSH"
mksquashfs "$BUILD_DIR" "$OUT_SQSH" -noappend -comp gzip -no-xattrs

echo "Done: $OUT_SQSH"
34 changes: 34 additions & 0 deletions 1-quickstart/run_base.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
#!/bin/bash

#SBATCH --job-name=quickstart-base
#SBATCH --account=project_462000131
#SBATCH --partition=small-g

#SBATCH --nodes=1
#SBATCH --gpus-per-node=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=7
#SBATCH --mem-per-gpu=60G

#SBATCH --time=00:15:00

module use /appl/local/containers/ai-modules
module load singularity-AI-bindings

source ../env.sh

: "${CONTAINER:?Set CONTAINER in env.sh}"

singularity exec "$CONTAINER" python - <<'PY'
import platform
import torch

print(f"python={platform.python_version()}")
print(f"torch={torch.__version__}")
print(f"rocm={torch.version.hip}")
print(f"cuda_available={torch.cuda.is_available()}")
print(f"device_count={torch.cuda.device_count() if torch.cuda.is_available() else 0}")

assert torch.cuda.is_available(), "GPU not visible in container"
print("SMOKE TEST PASSED")
PY
23 changes: 23 additions & 0 deletions 1-quickstart/run_vit.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
#!/bin/bash

#SBATCH --job-name=quickstart-vit
#SBATCH --account=project_462000131
#SBATCH --partition=small-g

#SBATCH --nodes=1
#SBATCH --gpus-per-node=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=7
#SBATCH --mem-per-gpu=60G

#SBATCH --time=01:00:00

set -euo pipefail

module use /appl/local/containers/ai-modules
module load singularity-AI-bindings

source ../env.sh

: "${CONTAINER:?Set CONTAINER in env.sh}"
singularity exec -B ../resources/visiontransformer-env.sqsh:/user-software:image-src=/ "$CONTAINER" /user-software/bin/python visiontransformer.py
Loading