These ansible playbooks create and manage a dynamically allocated slurm cluster in an Openstack cloud. The cluster is based on Rocky Linux 9 and OpenHPC 3.x. Slurm configurations are based on the work contained in (https://github.com/XSEDE/CRI_Jetstream_Cluster).
- Access to an Openstack cloud such as Red Cloud at Cornell University Center for Advanced Computing
openrc
file containing credentials for accessing Openstack cloud.- A computer with python 3.6 or later installed.
- Clone this repo to your computer.
- cd to the directory containing the repo.
- Run the
install_ansible.sh
command.
To run the ansible playbooks described in the subsequent sections, you must first in the same terminal:
- Activate the ansible python virtual environment using the
source ansible/bin/activate
command. - Source in the Openstack cloud credentials in the
openrc
file. Confirm access by using theopenstack project list
command. The command should return a list of projects to which you have access in the Openstack cloud.
This yaml file defines variables that configure the cluster. The relevant ones are:
-
Cluster
cluster_name
: Use a name unique to your Openstack project. The playbooks identify cloud resources used by the cluster by this string in the resource names. After you set this variable initially, please do not change it. Need to update the default values..
-
Head Node
head_node_flavor
: instance flavor of the head node.head_node_disk_size_gb
: disk size of the head node in GB.install_intel_oneapi
: set totrue
to install Intel OneAPI compilers and MPI.
-
Access
ssh_public_keyfile
andssh_private_keyfile
: Full paths (no tilde) to matching ssh public and private keys for initial access to the cluster. Need to update the default values.cluster_network_ssh_access
: Restrict ssh access to the cluster to this IP range, or enter0.0.0.0/0
for no restrictions. Make sure this CIDR include the IP address of your computer so it can deploy the cluster!
-
Networking
cluster_network_dns_servers
: Enter the appropriate DNS server for your Openstack cloud. The default values are good for CAC Red Cloud.
-
Compute Imaging Insance: create_compute_image.yml uses this instance and create the compute node image. The playbook will create and delete this instance as needed.
compute_imaging_instance_flavor
: instance flavor of the compute imaging instance
-
Compute Node: If you change any of the parameters in this section after the cluster is deployed, re-run the
provision_head_node.yml
andcreate_compute_image.yml
playbook for the changes to take effect.compute_node_flavor
andcompute_node_cpus
: The flavor and CPU counts of a compute node. The CPU count must match the flavor orslurmd
might fail to start on the compute node.compute_node_disk_size_gb
: disk size of the compute node in GB.max_compute_nodes
: maximum number of compute nodes the cluster can have.slurm_suspend_time
: Number of seconds slurm waits before deleting an idle compute node instance.
Ansible playbooks are idempotent. After correcting an error or updating a variable, you should be able to run the playbooks as many times as needed.
- Create the head node by running the
create_head_node.yml
playbook:ansible-playbook create_head_node.yml
- Provision the head node by running the
provision_head_node.yml
playbook:ansible-playbook provision_head_node.yml
. This playbook will take a while to run depending on your head node flavor as it installs user software packages (gcc 9, openmpi 4, etc.) that come with OpenHPC 2.x distribution. - Build compute node image by running the
create_compute_image.yml
playbook. This playbook makes surecompute_node_image
image does not already exist in the cloud and will fail if it does. Delete the pre-existingcompute_node_image
using theopenstack image delete <compute_node_image>
command if needed:ansible-playbook create_compute_image.yml
You can gain initial access to the cluster by ssh to the head node's IP address as defined by the image_init_user
variable in vars/main.yml
. When you submit jobs to slurm using the sbatch
or srun
commands on the head node, slurm will create and delete compute node instances as defined by the max_compute_nodes
and slurm_suspend_time
variables.
When you are done with the cluster, delete all cloud resources used by the cluster using the destroy_cluster.yml
playbook: ansible-playbook destroy_cluster.yml
. Note: all data stored in the cluster will be lost, forever, and unrecoverable.