(3.0.0 3.2.1) Running nodes might be mistakenly replaced when new jobs are scheduled

The issue

Due to an issue present in the Slurm scheduler module (bug 14073), which is responsible to initiate the provisioning of dynamic cloud nodes, running nodes might be mistakenly replaced with new instances. As a consequence of this operation, jobs that were potentially running on the affected nodes are re-queued and the cluster undergoes a temporary over-scaling that is automatically recovered.

In particular the bug is triggered when a job is allocated to non contiguous nodes (example: node-1 and node-5 are not contiguous) that require to be resumed. Usually Slurm tries to resume only the powered down nodes allocated to the job, instead, due to the bug, Slurm tries to resume all nodes within that range, including those that are already running and that might potentially be running other jobs. For example if a job is allocated to node-1 and node-5, also node-[2-4] will be resumed. This leads to three potential scenarios based on the status of nodes that are mistakenly resumed:

If such nodes are powered down, the nodes will be launched but they will remain idle for the configured ScaledownIdletime. This situation is unlikely to occur since Slurm defaults to allocating contiguous nodes when all of them are powered-down. However in case of insufficient capacity errors (ICE) when resuming nodes, a fragmentation of the node space can induce this condition. Slurm fast insufficient capacity fail-over feature shipped with ParallelCluster 3.2.0 mitigates the fragmentation due to ICE.
If the nodes are idle, then the cluster undergoes a temporary over-scaling that is recovered within 1 minute.
If the nodes are running allocated and running a job, then the job will fail and be re-queued.

In order to detect occurrences of the issues, in particular when these are triggered by scenarios 2 and 3, you can check for the presence of Terminating orphaned instances message in the /var/log/parallelcluster/clustermgtd log file. Note that the presence of such log entry does not necessarily prove this specific issue. Here is an example:

2022-12-21 14:17:29,171 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance
2022-12-21 14:17:29,171 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Terminating orphaned instances
2022-12-21 14:17:29,179 - [slurm_plugin.instance_manager:delete_instances] - INFO - Terminating instances (x2) ['i-1234567890abcdef1', 'i-1234567890abcdef2']

Affected versions

ParallelCluster versions from 3.0.0 to 3.2.1 are impacted because they ship Slurm versions prior to 22.05

Mitigation

The suggested mitigation is to upgrade to the latest supported ParallelCluster version that distributes Slurm 22.05.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(3.0.0 3.2.1) Running nodes might be mistakenly replaced when new jobs are scheduled

The issue

Affected versions

Mitigation

Clone this wiki locally