diff --git a/02_Using_the_LUMI_web_interface/Clone_with_JupyterLab.md b/02_Using_the_LUMI_web_interface/Clone_with_JupyterLab.md deleted file mode 100644 index 50af268..0000000 --- a/02_Using_the_LUMI_web_interface/Clone_with_JupyterLab.md +++ /dev/null @@ -1,20 +0,0 @@ -# Cloning the course git repository using JupyterLab UI - -1. Open a JupyterLab session using the Jupyter app on the LUMI web interface [www.lumi.csc.fi](https://www.lumi.csc.fi) - - Follow the instructions in the second part of the exercise for this session. You can then keep using the session - for the rest of the exercise. - -2. Once you have opened JupyterLab and opened your own folder in the navigation panel to the left, your browser should present a view like this (in this case for user `lukaspre`): - - ![After starting JupyterLab and opening your own folder, the navigation panel shows an empty list and the main screen a selection of apps to use in JupyterLab.](images/step0.png) - -4. Use the highlighted button to open the UI popup for cloning a git repository: - - ![The button for cloning a git repository is in the top-left corner, just above the file search input.](images/step1.png) - -5. Enter the repository URL ( [https://github.com/Lumi-supercomputer/Getting_Started_with_AI_workshop](https://github.com/Lumi-supercomputer/Getting_Started_with_AI_workshop) ) and press the "Clone" button. - - ![The repository URL should be entered in the opening popup.](images/step2.png) - - This will clone the respository in a new folder "Getting_Started_with_AI_workshop" in your directory on the course project scratch filesystem. diff --git a/02_Using_the_LUMI_web_interface/GPT-neo-IMDB-introduction.ipynb b/02_Using_the_LUMI_web_interface/GPT-neo-IMDB-introduction.ipynb index 229c9c0..eb7e40c 100644 --- a/02_Using_the_LUMI_web_interface/GPT-neo-IMDB-introduction.ipynb +++ b/02_Using_the_LUMI_web_interface/GPT-neo-IMDB-introduction.ipynb @@ -39,7 +39,7 @@ "outputs": [], "source": [ "import os\n", - "os.environ[\"HF_HOME\"] = \"/flash/project_465002178/hf-cache\"" + "os.environ[\"HF_HOME\"] = \"/flash/project_465002757/hf-cache\"" ] }, { diff --git a/02_Using_the_LUMI_web_interface/README.md b/02_Using_the_LUMI_web_interface/README.md index a9bec70..0d08b22 100644 --- a/02_Using_the_LUMI_web_interface/README.md +++ b/02_Using_the_LUMI_web_interface/README.md @@ -7,12 +7,10 @@ In this exercise you will gain first experience with using the LUMI web interface to navigate files and directories on the LUMI supercomputer. You will also set up your own copy of the exercise repository on the system, so that you can work on them without interfering with the other course participants. 1. Log in to the LUMI web interface: https://www.lumi.csc.fi - 2. Create your own subdirectory in `/project/project_465002178/` and `/scratch/project_465002178/`. Use your username for the directory name. You can either + 2. Create your own subdirectory in `/project/project_465002757/` and `/scratch/project_465002757/`. Use your username for the directory name. You can either - Use the built-in file explorer ("Home Directory"), or - Use the login node shell app in the webinterface - 3. Clone the [exercise repository](https://github.com/Lumi-supercomputer/Getting_Started_with_AI_workshop) to your folder in `/project/project_465002178/`. You can either - - use the login node shell app in the webinterface, or - - start a Jupyter lab job and use the Jupyter lab UI for cloning Git repositories, see [Clone_with_JupyterLab.md](./Clone_with_JupyterLab.md) for an illustrated step-by-step guide for this. + 3. Clone the [exercise repository](https://github.com/Lumi-supercomputer/Getting_Started_with_AI_workshop) to your folder in `/project/project_465002757/`. You can use the login node shell app in the webinterface for that. 4. Get familiar with the exercise repository layout. 2. Start an interactive Jupyter lab job and run inference with GPT-neo. @@ -20,15 +18,19 @@ In this exercise you will learn how to reserve resources for and start an interactive job to run a Jupyter notebook via the LUMI web interface. The notebook itself introduces you to our running example of finetuning a language model using PyTorch and the training libraries provided by Huggingface. In this exercise you will not do any training, but familiarise yourself a bit with the software and the base model. 1. Start an interactive Jupyter session: Open the Jupyter app (! not "Jupyter for Courses" !) in the LUMI webinterface and set the following settings before pressing `Launch` - - Project: `project_465002178 (LUST Training ...)` + - Project: `project_465002757 (LUST Training ...)` - Reservation: Use the course reservation `AI_workshop_Day1` (there should only be one available option) - Partition: `small-g` - Number of CPU cores: `7` - Memory (GB): `16` - Time: `0:30:00` - Working directory: `/project/$PROJECT` - - Python: `pytorch (Via CSC stack, limited support available)` - - Virtual environment path: leave empty + - Press Advanced + - Custom Python Type: `Container` + - Modules to load: `Local-LAIF lumi-aif-singularity-bindings` + - Path to container with Python: `/appl/local/laifs/containers/lumi-multitorch-u24r64f21m43t29-20260319_153422/lumi-multitorch-full-u24r64f21m43t29-20260319_153422.sif` + - Container arguments: leave empty + - Init script for container: leave empty 2. Wait for the session to start, then press `Connect to Jupyter` > **Note** diff --git a/02_Using_the_LUMI_web_interface/images/step0.png b/02_Using_the_LUMI_web_interface/images/step0.png deleted file mode 100644 index ac4c4ff..0000000 Binary files a/02_Using_the_LUMI_web_interface/images/step0.png and /dev/null differ diff --git a/02_Using_the_LUMI_web_interface/images/step1.png b/02_Using_the_LUMI_web_interface/images/step1.png deleted file mode 100644 index 58de26c..0000000 Binary files a/02_Using_the_LUMI_web_interface/images/step1.png and /dev/null differ diff --git a/02_Using_the_LUMI_web_interface/images/step2.png b/02_Using_the_LUMI_web_interface/images/step2.png deleted file mode 100644 index 146ab64..0000000 Binary files a/02_Using_the_LUMI_web_interface/images/step2.png and /dev/null differ diff --git a/03_Your_first_AI_training_job_on_LUMI/reference_solution/resume_from_checkpoint/run.sh b/03_Your_first_AI_training_job_on_LUMI/reference_solution/resume_from_checkpoint/run.sh index b23c7b8..529b0d1 100644 --- a/03_Your_first_AI_training_job_on_LUMI/reference_solution/resume_from_checkpoint/run.sh +++ b/03_Your_first_AI_training_job_on_LUMI/reference_solution/resume_from_checkpoint/run.sh @@ -1,5 +1,5 @@ #!/bin/bash -#SBATCH --account=project_465002178 +#SBATCH --account=project_465002757 #SBATCH --reservation=AI_workshop_Day1 # comment this out if the reservation is no longer available #SBATCH --partition=small-g #SBATCH --gpus-per-node=1 @@ -10,14 +10,14 @@ # Set up the software environment # NOTE: the loaded module makes relevant filesystem locations available inside the singularity container -# (/scratch, /project, etc) as well as mounts some important system libraries that are optimized for LUMI +# (/scratch, /project, etc) # If you are interested, you can check the exact paths being mounted from -# /appl/local/containers/ai-modules/singularity-AI-bindings/24.03.lua +# /appl/local/laifs/modules/lumi-aif-singularity-bindings/1.0.0.lua module purge -module use /appl/local/containers/ai-modules -module load singularity-AI-bindings +module use /appl/local/laifs/modules +module load lumi-aif-singularity-bindings -CONTAINER=/appl/local/containers/sif-images/lumi-pytorch-rocm-6.2.4-python-3.12-pytorch-v2.6.0.sif +CONTAINER=/appl/local/laifs/containers/lumi-multitorch-u24r64f21m43t29-20260319_153422/lumi-multitorch-full-u24r64f21m43t29-20260319_153422.sif # Some environment variables to set up cache directories SCRATCH="/scratch/${SLURM_JOB_ACCOUNT}" @@ -35,7 +35,7 @@ export OUTPUT_DIR=$SCRATCH/$USER/data/ export LOGGING_DIR=$SCRATCH/$USER/runs/ set -xv # print the command so that we can verify setting arguments correctly from the logs -srun singularity exec $CONTAINER \ +srun singularity run $CONTAINER \ python GPT-neo-IMDB-finetuning.py \ --model-name gpt-imdb-model \ --output-path $OUTPUT_DIR \ diff --git a/03_Your_first_AI_training_job_on_LUMI/reference_solution/run.sh b/03_Your_first_AI_training_job_on_LUMI/reference_solution/run.sh index 92c5a13..e0530f4 100644 --- a/03_Your_first_AI_training_job_on_LUMI/reference_solution/run.sh +++ b/03_Your_first_AI_training_job_on_LUMI/reference_solution/run.sh @@ -1,5 +1,5 @@ #!/bin/bash -#SBATCH --account=project_465002178 +#SBATCH --account=project_465002757 #SBATCH --reservation=AI_workshop_Day1 # comment this out if the reservation is no longer available #SBATCH --partition=small-g #SBATCH --gpus-per-node=1 @@ -10,14 +10,14 @@ # Set up the software environment # NOTE: the loaded module makes relevant filesystem locations available inside the singularity container -# (/scratch, /project, etc) as well as mounts some important system libraries that are optimized for LUMI +# (/scratch, /project, etc) # If you are interested, you can check the exact paths being mounted from -# /appl/local/containers/ai-modules/singularity-AI-bindings/24.03.lua +# /appl/local/laifs/modules/lumi-aif-singularity-bindings/1.0.0.lua module purge -module use /appl/local/containers/ai-modules -module load singularity-AI-bindings +module use /appl/local/laifs/modules +module load lumi-aif-singularity-bindings -CONTAINER=/appl/local/containers/sif-images/lumi-pytorch-rocm-6.2.4-python-3.12-pytorch-v2.6.0.sif +CONTAINER=/appl/local/laifs/containers/lumi-multitorch-u24r64f21m43t29-20260319_153422/lumi-multitorch-full-u24r64f21m43t29-20260319_153422.sif # Some environment variables to set up cache directories SCRATCH="/scratch/${SLURM_JOB_ACCOUNT}" @@ -35,7 +35,7 @@ export OUTPUT_DIR=$SCRATCH/$USER/data/ export LOGGING_DIR=$SCRATCH/$USER/runs/ set -xv # print the command so that we can verify setting arguments correctly from the logs -srun singularity exec $CONTAINER \ +srun singularity run $CONTAINER \ python GPT-neo-IMDB-finetuning.py \ --model-name gpt-imdb-model \ --output-path $OUTPUT_DIR \ diff --git a/03_Your_first_AI_training_job_on_LUMI/run.sh b/03_Your_first_AI_training_job_on_LUMI/run.sh index fab3d08..a79c78f 100644 --- a/03_Your_first_AI_training_job_on_LUMI/run.sh +++ b/03_Your_first_AI_training_job_on_LUMI/run.sh @@ -1,19 +1,19 @@ #!/bin/bash -#SBATCH --account=project_465002178 +#SBATCH --account=project_465002757 #SBATCH --reservation=AI_workshop_Day1 # comment this out if the reservation is no longer available #SBATCH --partition=... ## # Set up the software environment # NOTE: the loaded module makes relevant filesystem locations available inside the singularity container -# (/scratch, /project, etc) as well as mounts some important system libraries that are optimized for LUMI +# (/scratch, /project, etc) # If you are interested, you can check the exact paths being mounted from -# /appl/local/containers/ai-modules/singularity-AI-bindings/24.03.lua +# /appl/local/laifs/modules/lumi-aif-singularity-bindings/1.0.0.lua module purge -module use /appl/local/containers/ai-modules -module load singularity-AI-bindings +module use /appl/local/laifs/modules +module load lumi-aif-singularity-bindings -CONTAINER=/appl/local/containers/sif-images/lumi-pytorch-rocm-6.2.4-python-3.12-pytorch-v2.6.0.sif +CONTAINER=/appl/local/laifs/containers/lumi-multitorch-u24r64f21m43t29-20260319_153422/lumi-multitorch-full-u24r64f21m43t29-20260319_153422.sif # Some environment variables to set up cache directories SCRATCH="/scratch/${SLURM_JOB_ACCOUNT}" diff --git a/08_Scaling_to_multiple_GPUs/reference_solution/prints_only_from_single_process/run_no_torchrun.sh b/08_Scaling_to_multiple_GPUs/reference_solution/prints_only_from_single_process/run_no_torchrun.sh index 67581e1..d566eda 100644 --- a/08_Scaling_to_multiple_GPUs/reference_solution/prints_only_from_single_process/run_no_torchrun.sh +++ b/08_Scaling_to_multiple_GPUs/reference_solution/prints_only_from_single_process/run_no_torchrun.sh @@ -1,5 +1,5 @@ #!/bin/bash -#SBATCH --account=project_465002178 +#SBATCH --account=project_465002757 #SBATCH --reservation=AI_workshop_Day2 # comment this out if the reservation is no longer available #SBATCH --partition=standard-g #SBATCH --nodes=1 @@ -11,14 +11,14 @@ # Set up the software environment # NOTE: the loaded module makes relevant filesystem locations available inside the singularity container -# (/scratch, /project, etc) as well as mounts some important system libraries that are optimized for LUMI +# (/scratch, /project, etc) # If you are interested, you can check the exact paths being mounted from -# /appl/local/containers/ai-modules/singularity-AI-bindings/24.03.lua +# /appl/local/laifs/modules/lumi-aif-singularity-bindings/1.0.0.lua module purge -module use /appl/local/containers/ai-modules -module load singularity-AI-bindings +module use /appl/local/laifs/modules +module load lumi-aif-singularity-bindings -CONTAINER=/appl/local/containers/sif-images/lumi-pytorch-rocm-6.2.4-python-3.12-pytorch-v2.6.0.sif +CONTAINER=/appl/local/laifs/containers/lumi-multitorch-u24r64f21m43t29-20260319_153422/lumi-multitorch-full-u24r64f21m43t29-20260319_153422.sif # Some environment variables to set up cache directories SCRATCH="/scratch/${SLURM_JOB_ACCOUNT}" @@ -46,7 +46,7 @@ export LOCAL_WORLD_SIZE=$SLURM_GPUS_PER_NODE # As opposed to the example in `run_torchrun.sh`, we can set the CPU binds directly via the slurm command, since we have # one task per GPU. In this case we do NOT need to set them from within the Python code itself. -srun singularity exec $CONTAINER \ +srun singularity run $CONTAINER \ bash -c "RANK=\$SLURM_PROCID \ LOCAL_RANK=\$SLURM_LOCALID \ python GPT-neo-IMDB-finetuning.py \ diff --git a/08_Scaling_to_multiple_GPUs/reference_solution/prints_only_from_single_process/run_torchrun.sh b/08_Scaling_to_multiple_GPUs/reference_solution/prints_only_from_single_process/run_torchrun.sh index 5bc4378..ddeb771 100644 --- a/08_Scaling_to_multiple_GPUs/reference_solution/prints_only_from_single_process/run_torchrun.sh +++ b/08_Scaling_to_multiple_GPUs/reference_solution/prints_only_from_single_process/run_torchrun.sh @@ -1,5 +1,5 @@ #!/bin/bash -#SBATCH --account=project_465002178 +#SBATCH --account=project_465002757 #SBATCH --reservation=AI_workshop_Day2 # comment this out if the reservation is no longer available #SBATCH --partition=standard-g #SBATCH --nodes=1 @@ -11,14 +11,14 @@ # Set up the software environment # NOTE: the loaded module makes relevant filesystem locations available inside the singularity container -# (/scratch, /project, etc) as well as mounts some important system libraries that are optimized for LUMI +# (/scratch, /project, etc) # If you are interested, you can check the exact paths being mounted from -# /appl/local/containers/ai-modules/singularity-AI-bindings/24.03.lua +# /appl/local/laifs/modules/lumi-aif-singularity-bindings/1.0.0.lua module purge -module use /appl/local/containers/ai-modules -module load singularity-AI-bindings +module use /appl/local/laifs/modules +module load lumi-aif-singularity-bindings -CONTAINER=/appl/local/containers/sif-images/lumi-pytorch-rocm-6.2.4-python-3.12-pytorch-v2.6.0.sif +CONTAINER=/appl/local/laifs/containers/lumi-multitorch-u24r64f21m43t29-20260319_153422/lumi-multitorch-full-u24r64f21m43t29-20260319_153422.sif # Some environment variables to set up cache directories SCRATCH="/scratch/${SLURM_JOB_ACCOUNT}" @@ -41,7 +41,7 @@ set -xv # print the command so that we can verify setting arguments correctly fr # Since we start only one task with slurm which then starts subprocesses, we cannot use slurm to configure CPU binds. # Therefore we need to set them up in the Python code itself. -srun singularity exec $CONTAINER \ +srun singularity run $CONTAINER \ torchrun --standalone \ --nnodes=1 \ --nproc-per-node=${SLURM_GPUS_PER_NODE} \ diff --git a/08_Scaling_to_multiple_GPUs/reference_solution/run_no_torchrun.sh b/08_Scaling_to_multiple_GPUs/reference_solution/run_no_torchrun.sh index 67581e1..d566eda 100644 --- a/08_Scaling_to_multiple_GPUs/reference_solution/run_no_torchrun.sh +++ b/08_Scaling_to_multiple_GPUs/reference_solution/run_no_torchrun.sh @@ -1,5 +1,5 @@ #!/bin/bash -#SBATCH --account=project_465002178 +#SBATCH --account=project_465002757 #SBATCH --reservation=AI_workshop_Day2 # comment this out if the reservation is no longer available #SBATCH --partition=standard-g #SBATCH --nodes=1 @@ -11,14 +11,14 @@ # Set up the software environment # NOTE: the loaded module makes relevant filesystem locations available inside the singularity container -# (/scratch, /project, etc) as well as mounts some important system libraries that are optimized for LUMI +# (/scratch, /project, etc) # If you are interested, you can check the exact paths being mounted from -# /appl/local/containers/ai-modules/singularity-AI-bindings/24.03.lua +# /appl/local/laifs/modules/lumi-aif-singularity-bindings/1.0.0.lua module purge -module use /appl/local/containers/ai-modules -module load singularity-AI-bindings +module use /appl/local/laifs/modules +module load lumi-aif-singularity-bindings -CONTAINER=/appl/local/containers/sif-images/lumi-pytorch-rocm-6.2.4-python-3.12-pytorch-v2.6.0.sif +CONTAINER=/appl/local/laifs/containers/lumi-multitorch-u24r64f21m43t29-20260319_153422/lumi-multitorch-full-u24r64f21m43t29-20260319_153422.sif # Some environment variables to set up cache directories SCRATCH="/scratch/${SLURM_JOB_ACCOUNT}" @@ -46,7 +46,7 @@ export LOCAL_WORLD_SIZE=$SLURM_GPUS_PER_NODE # As opposed to the example in `run_torchrun.sh`, we can set the CPU binds directly via the slurm command, since we have # one task per GPU. In this case we do NOT need to set them from within the Python code itself. -srun singularity exec $CONTAINER \ +srun singularity run $CONTAINER \ bash -c "RANK=\$SLURM_PROCID \ LOCAL_RANK=\$SLURM_LOCALID \ python GPT-neo-IMDB-finetuning.py \ diff --git a/08_Scaling_to_multiple_GPUs/reference_solution/run_torchrun.sh b/08_Scaling_to_multiple_GPUs/reference_solution/run_torchrun.sh index 5bc4378..ddeb771 100644 --- a/08_Scaling_to_multiple_GPUs/reference_solution/run_torchrun.sh +++ b/08_Scaling_to_multiple_GPUs/reference_solution/run_torchrun.sh @@ -1,5 +1,5 @@ #!/bin/bash -#SBATCH --account=project_465002178 +#SBATCH --account=project_465002757 #SBATCH --reservation=AI_workshop_Day2 # comment this out if the reservation is no longer available #SBATCH --partition=standard-g #SBATCH --nodes=1 @@ -11,14 +11,14 @@ # Set up the software environment # NOTE: the loaded module makes relevant filesystem locations available inside the singularity container -# (/scratch, /project, etc) as well as mounts some important system libraries that are optimized for LUMI +# (/scratch, /project, etc) # If you are interested, you can check the exact paths being mounted from -# /appl/local/containers/ai-modules/singularity-AI-bindings/24.03.lua +# /appl/local/laifs/modules/lumi-aif-singularity-bindings/1.0.0.lua module purge -module use /appl/local/containers/ai-modules -module load singularity-AI-bindings +module use /appl/local/laifs/modules +module load lumi-aif-singularity-bindings -CONTAINER=/appl/local/containers/sif-images/lumi-pytorch-rocm-6.2.4-python-3.12-pytorch-v2.6.0.sif +CONTAINER=/appl/local/laifs/containers/lumi-multitorch-u24r64f21m43t29-20260319_153422/lumi-multitorch-full-u24r64f21m43t29-20260319_153422.sif # Some environment variables to set up cache directories SCRATCH="/scratch/${SLURM_JOB_ACCOUNT}" @@ -41,7 +41,7 @@ set -xv # print the command so that we can verify setting arguments correctly fr # Since we start only one task with slurm which then starts subprocesses, we cannot use slurm to configure CPU binds. # Therefore we need to set them up in the Python code itself. -srun singularity exec $CONTAINER \ +srun singularity run $CONTAINER \ torchrun --standalone \ --nnodes=1 \ --nproc-per-node=${SLURM_GPUS_PER_NODE} \ diff --git a/08_Scaling_to_multiple_GPUs/reference_solution/with_cpu_bindings/run_no_torchrun.sh b/08_Scaling_to_multiple_GPUs/reference_solution/with_cpu_bindings/run_no_torchrun.sh index 4b3e62a..8cc11ed 100644 --- a/08_Scaling_to_multiple_GPUs/reference_solution/with_cpu_bindings/run_no_torchrun.sh +++ b/08_Scaling_to_multiple_GPUs/reference_solution/with_cpu_bindings/run_no_torchrun.sh @@ -1,5 +1,5 @@ #!/bin/bash -#SBATCH --account=project_465002178 +#SBATCH --account=project_465002757 #SBATCH --reservation=AI_workshop_Day2 # comment this out if the reservation is no longer available #SBATCH --partition=standard-g #SBATCH --nodes=1 @@ -11,14 +11,14 @@ # Set up the software environment # NOTE: the loaded module makes relevant filesystem locations available inside the singularity container -# (/scratch, /project, etc) as well as mounts some important system libraries that are optimized for LUMI +# (/scratch, /project, etc) # If you are interested, you can check the exact paths being mounted from -# /appl/local/containers/ai-modules/singularity-AI-bindings/24.03.lua +# /appl/local/laifs/modules/lumi-aif-singularity-bindings/1.0.0.lua module purge -module use /appl/local/containers/ai-modules -module load singularity-AI-bindings +module use /appl/local/laifs/modules +module load lumi-aif-singularity-bindings -CONTAINER=/appl/local/containers/sif-images/lumi-pytorch-rocm-6.2.4-python-3.12-pytorch-v2.6.0.sif +CONTAINER=/appl/local/laifs/containers/lumi-multitorch-u24r64f21m43t29-20260319_153422/lumi-multitorch-full-u24r64f21m43t29-20260319_153422.sif # Some environment variables to set up cache directories SCRATCH="/scratch/${SLURM_JOB_ACCOUNT}" @@ -52,7 +52,7 @@ CPU_BIND_MASKS="0x00fe000000000000,0xfe00000000000000,0x0000000000fe0000,0x00000 # tell slurm to configure the cpu binds specified by the mask, additional option v prints to configuration to the logs srun --cpu-bind=v,mask_cpu=$CPU_BIND_MASKS \ - singularity exec $CONTAINER \ + singularity run $CONTAINER \ bash -c "RANK=\$SLURM_PROCID \ LOCAL_RANK=\$SLURM_LOCALID \ python GPT-neo-IMDB-finetuning.py \ diff --git a/08_Scaling_to_multiple_GPUs/reference_solution/with_cpu_bindings/run_torchrun.sh b/08_Scaling_to_multiple_GPUs/reference_solution/with_cpu_bindings/run_torchrun.sh index 8bc4509..a4a067b 100644 --- a/08_Scaling_to_multiple_GPUs/reference_solution/with_cpu_bindings/run_torchrun.sh +++ b/08_Scaling_to_multiple_GPUs/reference_solution/with_cpu_bindings/run_torchrun.sh @@ -1,5 +1,5 @@ #!/bin/bash -#SBATCH --account=project_465002178 +#SBATCH --account=project_465002757 #SBATCH --reservation=AI_workshop_Day2 # comment this out if the reservation is no longer available #SBATCH --partition=standard-g #SBATCH --nodes=1 @@ -11,14 +11,14 @@ # Set up the software environment # NOTE: the loaded module makes relevant filesystem locations available inside the singularity container -# (/scratch, /project, etc) as well as mounts some important system libraries that are optimized for LUMI +# (/scratch, /project, etc) # If you are interested, you can check the exact paths being mounted from -# /appl/local/containers/ai-modules/singularity-AI-bindings/24.03.lua +# /appl/local/laifs/modules/lumi-aif-singularity-bindings/1.0.0.lua module purge -module use /appl/local/containers/ai-modules -module load singularity-AI-bindings +module use /appl/local/laifs/modules +module load lumi-aif-singularity-bindings -CONTAINER=/appl/local/containers/sif-images/lumi-pytorch-rocm-6.2.4-python-3.12-pytorch-v2.6.0.sif +CONTAINER=/appl/local/laifs/containers/lumi-multitorch-u24r64f21m43t29-20260319_153422/lumi-multitorch-full-u24r64f21m43t29-20260319_153422.sif # Some environment variables to set up cache directories SCRATCH="/scratch/${SLURM_JOB_ACCOUNT}" @@ -41,7 +41,7 @@ set -xv # print the command so that we can verify setting arguments correctly fr # Since we start only one task with slurm which then starts subprocesses, we cannot use slurm to configure CPU binds. # Therefore we need to set them up in the Python code itself. -srun singularity exec $CONTAINER \ +srun singularity run $CONTAINER \ torchrun --standalone \ --nnodes=1 \ --nproc-per-node=${SLURM_GPUS_PER_NODE} \ diff --git a/08_Scaling_to_multiple_GPUs/run.sh b/08_Scaling_to_multiple_GPUs/run.sh index 050d7c7..153d321 100644 --- a/08_Scaling_to_multiple_GPUs/run.sh +++ b/08_Scaling_to_multiple_GPUs/run.sh @@ -1,19 +1,19 @@ #!/bin/bash -#SBATCH --account=project_465002178 +#SBATCH --account=project_465002757 #SBATCH --reservation=AI_workshop_Day2 # comment this out if the reservation is no longer available #SBATCH --partition=... ## # Set up the software environment # NOTE: the loaded module makes relevant filesystem locations available inside the singularity container -# (/scratch, /project, etc) as well as mounts some important system libraries that are optimized for LUMI +# (/scratch, /project, etc) # If you are interested, you can check the exact paths being mounted from -# /appl/local/containers/ai-modules/singularity-AI-bindings/24.03.lua +# /appl/local/laifs/modules/lumi-aif-singularity-bindings/1.0.0.lua module purge -module use /appl/local/containers/ai-modules -module load singularity-AI-bindings +module use /appl/local/laifs/modules +module load lumi-aif-singularity-bindings -CONTAINER=/appl/local/containers/sif-images/lumi-pytorch-rocm-6.2.4-python-3.12-pytorch-v2.6.0.sif +CONTAINER=/appl/local/laifs/containers/lumi-multitorch-u24r64f21m43t29-20260319_153422/lumi-multitorch-full-u24r64f21m43t29-20260319_153422.sif # Some environment variables to set up cache directories SCRATCH="/scratch/${SLURM_JOB_ACCOUNT}"