-
Notifications
You must be signed in to change notification settings - Fork 312
Upgrade Slurm in an AWS ParallelCluster cluster
An AWS ParallelCluster release comes with a set of AMIs for the supported operating systems and EC2 platforms. Each AMI contains a software stack, including the Slurm package, that has been validated at ParallelCluster release time. If you wish to upgrade the Slurm on your cluster you can follow this guide.
WARNING: due to the integration between ParallelCluster and Slurm, you must keep Slurm within the same major version that was provided in the release of ParallelCluster used in the cluster (see the ParallelCluster public documentation for more information about the versions of Slurm used in the various releases of ParallelCluster). E.g, a cluster running with ParallelCluster 3.7.0 must have a 23.02.x version of Slurm.
If you wish to upgrade Slurm on the head node of your cluster, you cannot rely on upgrading your AMI, as in ParallelCluster the AMI of the head node cannot be changed with a pcluster update-cluster
operation. In this case, please follow these steps (here we are installing version 23.02.5).
- Stop the compute fleet on the cluster via a
pcluster update-compute-fleet -n <cluster_name> --status STOP_REQUESTED
operation, and wait for the compute fleet to be stopped. - Verify installed version of Slurm:
$ sinfo -V
slurm 23.02.4
- Verify which Slurm daemons are active on the head node and stop them (as root):
$ systemctl stop slurmrestd # Only if present
$ systemctl stop slurmctld
$ systemctl stop slurmdbd # Only if present
-
Backup existing installation of Slurm.
WARNING: the following command includes the
etc
folder under the Slurm installation folder. Please mind this when re-extracting files from this tar onto/opt/slurm/
. For instance, you can avoid overriding/opt/slurm/etc
by using the--exclude
flag of thetar
utility.
$ tar czf slurm_backup_"$(date +%Y%m%d-%H%M%S)".tar.gz slurm
-
As root, recompile Slurm, by executing the following script in the head node (the uninstall and rebuild of Slurm will not impact the
etc
folder under the installation folder, preserving the existing Slurm configuration).WARNING: if you do not wish to recompile Slurm on the head node, you can create a new instance from the EC2 console using the launch more like this functionality and launch the compilation there with the script provided below. You can later copy the
/opt/slurm/
folder (excluding theetc
subdirectory) back to the head node.
#!/bin/bash
set -e
# Set desired version of Slurm to be installed on the cluster
SLURM_VERSION_NEW=slurm-23-02-5-1
# activate python virtual env
source /opt/parallelcluster/pyenv/versions/cookbook_virtualenv/bin/activate
# go to the Chef local cache folder
cd /etc/chef/local-mode-cache/cache/
# download new version of slurm
wget https://github.com/SchedMD/slurm/archive/${SLURM_VERSION_NEW}.tar.gz
# uninstall current version
cd slurm-slurm-*
make uninstall
cd -
rm -rf slurm-slurm-*
# unpack new version
tar xf ${SLURM_VERSION_NEW}.tar.gz
cd slurm-${SLURM_VERSION_NEW}
# compile and install the new version
./configure --prefix=/opt/slurm --with-pmix=/opt/pmix --with-jwt=/opt/libjwt --enable-slurmrestd
CORES=$(grep processor /proc/cpuinfo | wc -l)
make -j $CORES
make install
make install-contrib
# deactivate python virtual env
deactivate
- Restart the Slurm daemons:
$ systemctl start slurmdbd # Only if present
$ systemctl start slurmctld
$ systemctl start slurmrestd # Only if present
- Verify installed version of Slurm:
$ sinfo -V
slurm 23.02.5
- Verify that all the required daemons are running with the new version of Slurm:
$ sudo grep "slurmctld version" /var/log/slurmctld.log | tail -n 1
[<timestamp>] slurmctld version 23.02.5 started on cluster <cluster-name>
$ sudo grep "slurmdbd version" /var/log/slurmdbd.log | tail -n 1
[<timestamp>] slurmdbd version 23.02.5 started
- Restart the compute fleet via a
pcluster update-compute-fleet -n <cluster_name> --status START_REQUESTED
operation.
The Slurm package will be available in the compute nodes, through the /opt/slurm
shared folder. It’s not required to execute any action on them.