Update "Install Slurm" documentation to leverage cloud-init#82
Update "Install Slurm" documentation to leverage cloud-init#82lunamorrow wants to merge 23 commits intoOpenCHAMI:mainfrom
Conversation
…n will need some further updates to align better with the Tutorial (e.g. changing IP addresses, adjusting comments to support bare-metal and cloud setups, etc.) and to ensure the documented approach is sufficiently broad for general purpose. Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
… for creating some files from cat to copy-paste to prevent issues with bash command/variable processing Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
… - this should make this guide easy to follow on with after the tutorial Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…Next step will be expanding comments/explanations to provide more context to users, as well as providing more code blocks to show expected output of commands that produce output. Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…id. Changes include making it more clear when pwgen password is used, correcting the file creation step for slurm.conf to prevent errors, removing instructions for aliasing the build commend (and instead redirecting to the appropriate tutorial section), updating instructions inline with a recent PR to replace MinIO with Versity S3 and some minor typo fixes Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…ck from David. Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…Some reviews are still pending as I figure out the source of the problem and a solution, and I will address these in a later commit. Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
… to VM head nodes. Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…certain commands shoudl behave and/or the output they should produce. Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…ecurity vulnerabilities with versions 0.5-0.5.17 Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…ompute node. Additionally made some tweaks to the documentation to make the workflow more robust after repeating it on a fresh node. Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…in a few places Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…erence to the 'Install Slurm' guide Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…t and the image config to reduce the number of commands needing to be run on the compute node. We are waiting on feedback from David and Alex before potentially implementing a more persistent Slurm configuration on the compute node/s. Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…evon Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
… in the working directory '/opt/workdir' (as desired) and not the user's home directory Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…r' in the slurm-local.repo file Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…f slurm RPMs in '/opt/workdir' Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…ommand Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
… explanation that the SlurmctldHost must be 'head' instead of 'demo' when the head node is a VM Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…rrow/cloud-init-compute-node-slurm-config Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
…t so that compute node Slurm configuration is persistent across nodes and on reboot Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
|
I have made some changes to the documentation to use cloud-init instead of manually configuring the compute node. This process also sets up NFS to mount shared files (e.g. Slurm configuration files) used by both the compute node and head node. The current commit only adds a basic compute node configuration (similar to what was already there, only with cloud-init now), but I am able to push up a more complex configuration which sets up LDAP and mounts the compute node with more memory for a more "realistic" Slurm setup. That way anyone who follows the guide will finish with a more production-ready Slurm configuration. Let me know what you think @synackd @davidallendj @alexlovelltroy The merge I performed on this branch pulled in quite a lot of old commits which has clogged up this PR a bit, sorry about that! |
|
|
||
| ```bash | ||
| sudo systemctl start slurmdbd | ||
| sudo systemctl start slurmctld |
There was a problem hiding this comment.
I'm getting an error when trying to start slurmctld. I try starting with both SlurmctldHost=demo and SlurmctldHost=head and both give me the error.
[rocky@openchami-testing workdir]$ sudo systemctl start slurmdbd
sudo systemctl start slurmctld
Job for slurmctld.service failed because the control process exited with error code.
See "systemctl status slurmctld.service" and "journalctl -xeu slurmctld.service" for details.There was a problem hiding this comment.
That is odd. Is your head node a VM or is it baremetal/cloud instance? Could you also please share the output of sudo systemctl status slurmctld and sudo journalctl -xeu slurmctld.service for me, so I can identify the root cause?
There was a problem hiding this comment.
It's a cloud instance using JetStream 2. Here's what I have for journalctl -eu slurmctld.
Mar 17 19:38:35 openchami-testing.novalocal systemd[1]: Starting Slurm controller daemon...
Mar 17 19:38:35 openchami-testing.novalocal slurmctld[547181]: slurmctld: error: If using PrologFlags=Contain for pam_slurm_adopt, proctrack/cgroup is required. If not using pam_slurm_adopt, please ignore error.
Mar 17 19:38:35 openchami-testing.novalocal slurmctld[547181]: error: If using PrologFlags=Contain for pam_slurm_adopt, proctrack/cgroup is required. If not using pam_slurm_adopt, please ignore error.
Mar 17 19:38:35 openchami-testing.novalocal slurmctld[547181]: slurmctld: error: Configured MailProg is invalid
Mar 17 19:38:35 openchami-testing.novalocal slurmctld[547181]: slurmctld: slurmctld version 24.05.5 started on cluster demo
Mar 17 19:38:35 openchami-testing.novalocal slurmctld[547181]: slurmctld: error: This host (openchami-testing/openchami-testing.novalocal) not a valid controller
Mar 17 19:38:35 openchami-testing.novalocal systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE
Mar 17 19:38:35 openchami-testing.novalocal systemd[1]: slurmctld.service: Failed with result 'exit-code'.
Mar 17 19:38:35 openchami-testing.novalocal systemd[1]: Failed to start Slurm controller daemon.
Mar 17 19:40:04 openchami-testing.novalocal systemd[1]: Starting Slurm controller daemon...
Mar 17 19:40:04 openchami-testing.novalocal slurmctld[547264]: slurmctld: error: If using PrologFlags=Contain for pam_slurm_adopt, proctrack/cgroup is required. If not using pam_slurm_adopt, please ignore error.
Mar 17 19:40:04 openchami-testing.novalocal slurmctld[547264]: error: If using PrologFlags=Contain for pam_slurm_adopt, proctrack/cgroup is required. If not using pam_slurm_adopt, please ignore error.
Mar 17 19:40:04 openchami-testing.novalocal slurmctld[547264]: slurmctld: error: Configured MailProg is invalid
Mar 17 19:40:04 openchami-testing.novalocal slurmctld[547264]: slurmctld: slurmctld version 24.05.5 started on cluster demo
Mar 17 19:40:04 openchami-testing.novalocal slurmctld[547264]: slurmctld: error: This host (openchami-testing/openchami-testing.novalocal) not a valid controller
Mar 17 19:40:04 openchami-testing.novalocal systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE
Mar 17 19:40:04 openchami-testing.novalocal systemd[1]: slurmctld.service: Failed with result 'exit-code'.
Mar 17 19:40:04 openchami-testing.novalocal systemd[1]: Failed to start Slurm controller daemon.There was a problem hiding this comment.
Alright it looks like your head node has a different hostname than the tutorial defined one. You'll just need to either update the head node hostname to demo.openchami.cluster or update SlurmctldHost in /etc/slurm/slurm.conf to your hostname openchami-testing
…hown' command Signed-off-by: Luna Morrow <luna.morrow2@gmail.com>
Pull Request Template
Thank you for your contribution! Please ensure the following before submitting:
Checklist
make test(or equivalent) locally and all tests passgit commit -s) with my real name and email<filename>.licensesidecarLICENSES/directoryDescription
Updating/extending the "Install Slurm" documentation guide to leverage OpenCHAMI's cloud-init to make compute node configuration persistent across nodes and on reboot. See discussion/comments on PR #72.
Fixes #(issue)
Type of Change
For more info, see Contributing Guidelines.