Lumi-supercomputer · gregordecristoforo · Mar 16, 2026 · Apr 30, 2026
diff --git a/0-how-to-use-guide/README.md b/0-how-to-use-guide/README.md
@@ -0,0 +1,67 @@
+# 0. How to use this guide
+
+## Goal
+
+Start with a clear plan for running this guide on LUMI, including prerequisites, expected outcomes, and recommended lesson order.
+
+## Assumptions
+
+- You are new to this guide and want a reliable first path through it.
+- You can access LUMI and have a project account with GPU allocation.
+- You are comfortable with basic Linux shell commands.
+
+## Who this guide is for
+
+This guide is intended for users who want to move model training from smaller environments to LUMI with practical, runnable examples.
+
+## Prerequisites
+
+Before starting lesson 1, ensure you have:
+
+- a LUMI user account
+- a project with GPU hours
+- basic familiarity with Python and Slurm
+- a working clone of this repository in `/project` or `/scratch`
+
+## Recommended learning path
+
+Follow this order for the core workflow (`1` to `6`):
+
+1. [1. QuickStart](../1-quickstart/README.md)
+2. [2. Setting up your own environment](../2-setting-up-environment/README.md)
+3. [3. File formats for training data](../3-file-formats/README.md)
+4. [4. Data Storage Options](../4-data-storage/README.md)
+5. [5. Multi-GPU and Multi-Node Training](../5-multi-gpu-and-node/README.md)
+6. [6. Monitoring and Profiling jobs](../6-monitoring-and-profiling/README.md)
+
+After lesson `6`, the core path is complete. Continue with optional experiment-tracking modules as needed:
+
+- [7. TensorBoard visualization](../7-tensorboard-visualization/README.md)
+- [8. MLflow visualization](../8-mlflow-visualization/README.md)
+- [9. W&B visualization](../9-wandb-visualization/README.md)
+
+## Baseline conventions used across lessons
+
+- Keep a single stable container path in `env.sh`.
+- Keep datasets in a consistent location under `resources/` unless a chapter explicitly changes it.
+- Run batch jobs from the chapter directory so relative paths in scripts resolve correctly.
+
+## What success looks like
+
+By the end of the core workflow, you should be able to:
+
+- run a single-GPU training job
+- scale to multi-GPU and multi-node runs
+- choose storage and data formats intentionally
+- profile runs and interpret bottlenecks
+
+## Troubleshooting
+
+- Job submission fails immediately: verify `--account` and project permissions.
+- Container-related errors: confirm `env.sh` points to an existing `.sif`.
+- Missing files at runtime: run commands from the intended chapter directory.
+
+## Navigation
+
+- Previous: [Guide Home](../README.md)
+- Next: [1. QuickStart](../1-quickstart/README.md)
diff --git a/1-quickstart/README.md b/1-quickstart/README.md
@@ -1,84 +1,154 @@
 # 1. QuickStart
 
-This chapter covers how to set up the environment to run the [`visiontransformer.py`](visiontransformer.py) script on LUMI. 
+This lesson gives you a minimal, end-to-end first run of [`visiontransformer.py`](visiontransformer.py) on LUMI.
 
-First, you clone this repository to LUMI via the following command:
+## Goal
+
+Run one single-GPU training job on LUMI and confirm that:
+
+- the container works on GPU
+- the squashfs extension is available
+- training logs and a model checkpoint are produced
+
+## Assumptions
+
+- You have a LUMI project account and can submit jobs to `small-g`.
+- The repository is cloned on `/project` or `/scratch` (not `$HOME`).
+- `../env.sh` is configured for your environment and points to a valid container via `CONTAINER`.
+
+## Working directory
+
+Run commands in this chapter from:
+
+```bash
+cd /path/to/LUMI-AI-Guide/1-quickstart
+```
+
+## Minimal run checkpoint
+
+Command:
+
+```bash
+sbatch run_base.sh
+```
+
+Success signal:
+
+- Job output contains `SMOKE TEST PASSED`.
+
+Clone this repository to LUMI if needed:
 
 ```bash
 git clone https://github.com/Lumi-supercomputer/LUMI-AI-Guide.git
 ```
 
-We recommend using your `/project/` or `/scratch/` directory of your project to clone the repository as your home directory (`$HOME`) has a capacity of 20 GB and is intended to store user configuration files and personal data.
+Use `/project` or `/scratch` for this clone. `$HOME` has limited capacity and is meant for configuration and personal files.
 
-Next, navigate to the `LUMI-AI-Guide/quickstart` directory:
+Then move to the lesson directory:
 
 ```bash
 cd LUMI-AI-Guide/1-quickstart
 ```
 
-We now need to setup the environment if we wish to run the included python scripts. We will use one of the provided PyTorch containers that we extend with additional packages (this step will be explained in more detail in the next chapter [Setting up your own environment](../2-setting-up-environment/README.md)). The fastest way to achieve this is to use the provided script `set_up_environment.sh`:
+The recommended quickstart flow has three steps:
+
+1. Smoke-test the base container.
+2. Build the squashfs extension.
+3. Run the Vision Transformer training script with that extension.
+
+This keeps the runtime model consistent with the rest of the guide.
+
+## Step 1: Smoke test the base container
+
+Submit the base-container smoke test:
 
 ```bash
-./set_up_environment.sh
+sbatch run_base.sh
 ```
 
-If you receive a permission denied error, you can make the script executable by running:
+Check the job output in:
 
 ```bash
-chmod +x set_up_environment.sh
+slurm-<jobid>.out
 ```
 
-After the script has finished, you will see now two new files in the `LUMI-AI-Guide/resources/` directory. One is the training dataset in a `hdf5` file format (`train_images.hdf5`). The other one is the virtual python enviroment in a `sqfs` file format (`ai-guide-env.sqsh`).
+The output file is written to the directory where you run `sbatch` (here: `1-quickstart/`).
 
-For this example, we use the [Tiny ImageNet Dataset](https://paperswithcode.com/dataset/tiny-imagenet) which is already transformed into the file system friendly hdf5 format (Chapter [File formats for training data](../3-file-formats/README.md) explains in detail why this step is necessary). Please have a look at the terms of access for the ImageNet Dataset [here](https://www.image-net.org/download.php).
+You should see Python/Torch/ROCm version info and `SMOKE TEST PASSED`.
 
-To run the Vision Transformer example, we need to use a batch job script. We provide a batch job script [`run.sh`](run.sh) that you can use to run the [`visiontransformer.py`](visiontransformer.py) script on a single GPU on a LUMI-G node. 
-A quickstart to SLURM is provided in the [LUMI documentation](https://docs.lumi-supercomputer.eu/runjobs/scheduled-jobs/slurm-quickstart/). 
+## Step 2: Build the squashfs extension
+
+Build `visiontransformer-env.sqsh` from the base container:
+
+```bash
+./build_visiontransformer_sqsh.sh
+```
+
+- `../resources/visiontransformer-env.sqsh`
+
+The extension includes packages needed by the sample scripts, such as `h5py`.
 
-To run the provided script yourself, you need to replace the `--account` flag in line 2 of the [`run.sh`](run.sh) script with your own project account. You can find your project account by running the command `lumi-workspaces`.
+## Step 3: Run Vision Transformer
 
-After you have replaced the `--account` flag, you can submit the job to the LUMI scheduler by running:
+Submit:
 
 ```bash
-sbatch run.sh
+sbatch run_vit.sh
 ```
 
-Once the job starts running, a `slurm-<jobid>.out` file will be created in the `quickstart` directory. This file contains the output of the job and will be updated as the job progresses.
+`run_vit.sh` uses:
+
+- `env.sh` for container selection
+- `../resources/visiontransformer-env.sqsh` for the extended Python environment
+
+For quickstart, `run_vit.sh` uses `torchvision.datasets.FakeData` so the run is independent of dataset preparation.  
+Quickstart in this lesson uses FakeData only.
+The reported metrics are smoke-test indicators and are not meaningful model-quality numbers.
 
-*Note that we do a `module purge` at the beginning of the script. This will cause some warnings that some modules were not unloaded or could not be reloaded.
-It is safe to ignore these warnings at this point.*
+To run the Vision Transformer example, we use the batch job script [`run_vit.sh`](run_vit.sh), which runs [`visiontransformer.py`](visiontransformer.py) on a single GPU on a LUMI-G node.
+A quickstart to SLURM is provided in the [LUMI documentation](https://docs.lumi-supercomputer.eu/runjobs/scheduled-jobs/slurm-quickstart/). 
+
+If needed, replace the `--account` flag in [`run_base.sh`](run_base.sh) and [`run_vit.sh`](run_vit.sh) with your own project account. You can find your project account by running `lumi-workspaces`.
+
+Once the job starts, output is written to:
+
+```bash
+slurm-<jobid>.out
+```
 
+As in Step 1, this is written in the submit directory (`1-quickstart/`) because the script uses Slurm's default output behavior.
 
-The output will show Training Loss and Validation Accuracy values for each epoch, similar to the following:
+The output will show smoke-test metrics for each epoch, similar to the following:
 
 ```bash
-Training for 4 epochs in total and then saving trained model.
+Quickstart mode: using FakeData with random labels.
+Metrics are only for smoke testing runtime, not model quality.
 Starting epoch 1.
-Epoch 1, Training Loss: 4.825212168216705
-Validation Accuracy: 8.95%
+Epoch 1, Smoke Loss: 5.82
+Smoke Accuracy (chance~0.5%): 0.49%
 Starting epoch 2.
-Epoch 2, Training Loss: 4.165826177024841
-Validation Accuracy: 14.41%
-Starting epoch 3.
-Epoch 3, Training Loss: 3.792399849319458
-Validation Accuracy: 18.18%
-Starting epoch 4.
-Epoch 4, Training Loss: 3.558482360649109
-Validation Accuracy: 20.945%
-Saving model to vit_b_16_imagenet.pth
+Epoch 2, Smoke Loss: 5.46
+Smoke Accuracy (chance~0.5%): 0.24%
+...
 ```
 
-Congratulations! You have run your first training job on LUMI. The next chapter [Setting up your own environment](../2-setting-up-environment/README.md) will explain in more detail how the environment was set up and how you can set up your own environment for your AI projects on LUMI.
+## Verify
+
+After the three steps, confirm all of the following:
+
+- Base smoke-test output includes `SMOKE TEST PASSED`.
+- `../resources/visiontransformer-env.sqsh` exists.
+- Training job output includes epoch-level loss and accuracy logs.
+- `vit_b_16_imagenet.pth` is created in `1-quickstart/`.
+
+## Troubleshooting
+
+- Job fails at submission: update `--account` in [`run_base.sh`](run_base.sh) and [`run_vit.sh`](run_vit.sh), then check with `lumi-workspaces`.
+- Container variable error (`Set CONTAINER in env.sh`): set `CONTAINER` in `../env.sh` to a valid `.sif`.
+- Quickstart data: this lesson uses `FakeData`; no dataset files are required.
+- No GPU visible in smoke test: ensure `module load singularity-AI-bindings` is present and rerun.
 
- ### Table of contents
+## Navigation
 
-- [Home](..#readme)
-- [1. QuickStart](https://github.com/Lumi-supercomputer/LUMI-AI-Guide/tree/main/1-quickstart#readme)
-- [2. Setting up your own environment](https://github.com/Lumi-supercomputer/LUMI-AI-Guide/tree/main/2-setting-up-environment#readme)
-- [3. File formats for training data](https://github.com/Lumi-supercomputer/LUMI-AI-Guide/tree/main/3-file-formats#readme)
-- [4. Data Storage Options](https://github.com/Lumi-supercomputer/LUMI-AI-Guide/tree/main/4-data-storage#readme)
-- [5. Multi-GPU and Multi-Node Training](https://github.com/Lumi-supercomputer/LUMI-AI-Guide/tree/main/5-multi-gpu-and-node#readme)
-- [6. Monitoring and Profiling jobs](https://github.com/Lumi-supercomputer/LUMI-AI-Guide/tree/main/6-monitoring-and-profiling#readme)
-- [7. TensorBoard visualization](https://github.com/Lumi-supercomputer/LUMI-AI-Guide/tree/main/7-TensorBoard-visualization#readme)
-- [8. MLflow visualization](https://github.com/Lumi-supercomputer/LUMI-AI-Guide/tree/main/8-MLflow-visualization#readme)
-- [9. Wandb visualization](https://github.com/Lumi-supercomputer/LUMI-AI-Guide/tree/main/9-Wandb-visualization#readme)
+- Previous: [0. How to use this guide](../0-how-to-use-guide/README.md)
+- Next: [2. Setting up your own environment](../2-setting-up-environment/README.md)
diff --git a/1-quickstart/build_visiontransformer_sqsh.sh b/1-quickstart/build_visiontransformer_sqsh.sh
@@ -0,0 +1,29 @@
+#!/bin/bash
+set -euo pipefail
+
+module use /appl/local/containers/ai-modules
+module load singularity-AI-bindings
+
+source ../env.sh
+
+: "${CONTAINER:?Set CONTAINER in env.sh}"
+
+OUT_SQSH="../resources/visiontransformer-env.sqsh"
+BUILD_DIR="$SCRATCH_ROOT/visiontransformer-env-build"
+
+mkdir -p ../resources
+rm -rf "$BUILD_DIR"
+mkdir -p "$BUILD_DIR"
+
+echo "Building extension environment in: $BUILD_DIR"
+singularity exec -B "$BUILD_DIR":/user-software "$CONTAINER" bash -c '
+set -euo pipefail
+python -m venv /user-software --system-site-packages
+/user-software/bin/python -m pip install h5py lmdb msgpack six tqdm mlflow==2.22.0
+'
+
+rm -f "$OUT_SQSH"
+echo "Creating squashfs image: $OUT_SQSH"
+mksquashfs "$BUILD_DIR" "$OUT_SQSH" -noappend -comp gzip -no-xattrs
+
+echo "Done: $OUT_SQSH"
diff --git a/1-quickstart/run_base.sh b/1-quickstart/run_base.sh
@@ -0,0 +1,34 @@
+#!/bin/bash
+
+#SBATCH --job-name=quickstart-base
+#SBATCH --account=project_462000131
+#SBATCH --partition=small-g
+
+#SBATCH --nodes=1
+#SBATCH --gpus-per-node=1
+#SBATCH --ntasks-per-node=1
+#SBATCH --cpus-per-task=7
+#SBATCH --mem-per-gpu=60G
+
+#SBATCH --time=00:15:00
+
+module use /appl/local/containers/ai-modules
+module load singularity-AI-bindings
+
+source ../env.sh
+
+: "${CONTAINER:?Set CONTAINER in env.sh}"
+
+singularity exec "$CONTAINER" python - <<'PY'
+import platform
+import torch
+
+print(f"python={platform.python_version()}")
+print(f"torch={torch.__version__}")
+print(f"rocm={torch.version.hip}")
+print(f"cuda_available={torch.cuda.is_available()}")
+print(f"device_count={torch.cuda.device_count() if torch.cuda.is_available() else 0}")
+
+assert torch.cuda.is_available(), "GPU not visible in container"
+print("SMOKE TEST PASSED")
+PY
diff --git a/1-quickstart/run_vit.sh b/1-quickstart/run_vit.sh
@@ -0,0 +1,23 @@
+#!/bin/bash
+
+#SBATCH --job-name=quickstart-vit
+#SBATCH --account=project_462000131
+#SBATCH --partition=small-g
+
+#SBATCH --nodes=1
+#SBATCH --gpus-per-node=1
+#SBATCH --ntasks-per-node=1
+#SBATCH --cpus-per-task=7
+#SBATCH --mem-per-gpu=60G
+
+#SBATCH --time=01:00:00
+
+set -euo pipefail
+
+module use /appl/local/containers/ai-modules
+module load singularity-AI-bindings
+
+source ../env.sh
+
+: "${CONTAINER:?Set CONTAINER in env.sh}"
+singularity exec -B ../resources/visiontransformer-env.sqsh:/user-software:image-src=/ "$CONTAINER" /user-software/bin/python visiontransformer.py