-
Notifications
You must be signed in to change notification settings - Fork 312
Configuring all_or_nothing_batch launches
Only applicable for pcluster>=v2.10.0 && scheduler == slurm
.
Starting with AWS ParallelCluster version 2.10, you have additional control over instance launch behavior. By default, in versions <= 2.9, instances in a batch follow a soft launch (best-effort strategy), with at least some of the batch launching when some of the requested instances fail. However, with version 2.10, you can opt to configure instances in a batch to have a firm launch using an all-or-nothing launch behavior. As such, the entire batch fails if any one of the instances in the batch cannot be provisioned.
To enable the all-or-nothing launch behavior for dynamic nodes for firm launches, set all_or_nothing_batch = True
in the ResumeProgram
config at /etc/parallelcluster/slurm_plugin/parallelcluster_slurm_resume.conf
.
To further configure instance launch behavior with respect to the specific requirements of your application, you can set the max_batch_size
parameter. This parameter specifies the upper limit to the number of instances in your launch batch. The next section provides a more in-depth explanation for this parameter as well as several related considerations.
Note The default settings are all_or_nothing_batch = False
and max_batch_size = 500
. These settings ensure that instances in a batch have a soft launch with a smaller number of RunInstances
calls.
The following example is a ResumeProgram
config with the firm (all-or-nothing) launch behavior enabled:
$ cat /etc/parallelcluster/slurm_plugin/parallelcluster_slurm_resume.conf
# DO NOT MODIFY
[slurm_resume]
cluster_name = demo
region = us-east-2
proxy = NONE
dynamodb_table = parallelcluster-demo
hosted_zone = ZXXXXXXXXXXXXXXX
dns_domain = demo.pcluster
use_private_hostname = false
master_private_ip = 172.x.x.x
master_hostname = ip-172-x-x-x.us-east-2.compute.internal
# Options added:
all_or_nothing_batch = True
max_batch_size = 100
With its improved Slurm integration added to AWS ParallelCluster in version 2.9, dynamic nodes in ParallelCluster use the Slurm cloud bursting plugin and rely on the Slurm integrated power saving mechanism (for more details you can consult our Slurm user guide).
The workflow to launch instances for dynamic nodes is as follows:
- A job is submitted to Slurm.
- Slurm allocates the job to multiple nodes, putting any node that is in the
POWER_SAVED
state into thePOWER_UP
state. - Slurm calls
ResumeProgram
at a specified time internal to request nodes that need to be launched as part of thePOWER_UP
workflow. No job information is passed intoResumeProgram
. Slurm can callResumeProgram
with nodes from multiple jobs but doesn’t pass node-job correspondence information intoResumeProgram
. - Slum calls
ResumeProgram
with a list of nodes to launch. The nodes are parsed from the input following the same order in which they are passed. - For each of the instance types in a queue or partition, AWS ParallelCluster groups the nodes into batches based on their
max_batch_size
parameter configuration and uses theRunInstances
call to launch these batches. Each call is referred to as a launch-batch. - The
RunInstances
call also launches the nodes along with each of their corresponding instances. Any node that can’t be launched is placed in the POWER_DOWN state later reset to thePOWER_SAVED
state. - When the launched instances finish the configuration process, slurmd is started on compute instance. At this time, the job can also begin to use the corresponding slurm node.
The all_or_nothing_batch parameter modifies how RunInstances
calls behave. By default, RunInstances
calls follow a soft launch strategy in which some but not necessarily all instances in a launch-batch are launched by the call. If all_or_nothing_batch = True
is set, each RunInstances
call follows a firm launch strategy. For this, all instances in a launch-batch must be launched. If the capacity for the entire launch-batch isn’t reached, then the RunInstances
call fails and no instance is launched.
- Let's assume
max_batch_size = 4
and Slurm callsResumeProgram
with the following list of nodesqueue1-dy-c5xlarge-[1-5], queue1-dy-t2micro-2, queue2-dy-c5xlarge[1-3]
. This results in the following:- For every instance type in every queue and partition, nodes are grouped in to launch-batches according to the max_batch_size = 4 configuration. Therefore, we need to have the following batches:
queue1-dy-c5xlarge-[1-4], queue1-dy-c5xlarge-5, queue1-dy-t2micro-2, queue2-dy-c5xlarge-[1-3]
. OneRunInstances
call is made for every launch-batch. - If
all_or_nothing_batch = False
is set, andRunInstances
call forqueue2-dy-c5xlarge-[1-3]
can’t launch at full capacity, and can only launch one c5.xlarge instance, then queue2-dy-c5xlarge-1 is configured with the successfully launched instance, and queue2-dy-c5xlarge-[2-3] is marked down due to an unsuccessful launch for other instances. - If
all_or_nothing_batch = True
is set, and theRunInstances
call forqueue2-dy-c5xlarge-[1-3]
can’t launch at full capacity, the entireRunInstances
call fails and no instance is launched.queue2-dy-c5xlarge-[1-3]
is marked down due to an unsuccessful launch of the instances.
- For every instance type in every queue and partition, nodes are grouped in to launch-batches according to the max_batch_size = 4 configuration. Therefore, we need to have the following batches:
The following are the limits on this feature:
- Requests are executed on the basis of Slurm requests rather a job basis. Slurm groups multiple job requests that are submitted within close succession. This can sometimes result in failed an instance launch for groups of jobs even when the capacity is available to satisfy some (though not all) of the jobs.
- Example: Three jobs are submitted in close succession. Job1 is allocated to
queue1-dy-c5xlarge-[1-4]
, Job 2 is allocated toqueue1-dy-c5xlarge-[5-6]
, and Job 3 is allocated toqueue1-dy-c5xlarge-[7-8]
. For this,max_batch_size = 10
,all_or_nothing_batch = True
are set. Slurm callsResumeProgram
toPOWER_UP
queue1-dy-c5xlarge-[1-8]
. Because themax_batch_size = 10
is set,queue1-dy-c5xlarge-[1-8]
is grouped into the same launch-batch. However, because of this, if the entire capacity can’t be launched, thenqueue1-dy-c5xlarge-[1-8]
fails, causing all three jobs to fail. - This limit can be mitigated by spacing out job requests so each call to
ResumeProgram
only contains the nodes that are required for one job. We recommend waiting at least one minute between submitting two jobs, so that each call toResumeProgram
only contains nodes from one job each time.
- Example: Three jobs are submitted in close succession. Job1 is allocated to
- Firm all-or-nothing scaling is achieved on a per-instance type basis. You can’t use multiple instance types for all-or-nothing scaling. If you want to run a job that requires multiple instance types (for example, the P3dn and C5n instances types), all-or-nothing scaling guarantee isn’t plausible.
- *Example: *Job1 is allocated to
queue1-dy-c5xlarge-[1-4],queue1-dy-t2micro-1
,max_batch_size = 10
, andall_or_nothing_batch = True
is set. The launch-batches arequeue1-dy-c5xlarge-[1-4]
andqueue1-dy-t2micro-1
. These launch-batches are launched by two separateRunInstances
calls.
- *Example: *Job1 is allocated to
- All-or-nothing isn’t guaranteed if a node set allocated to a job isn’t contiguous.
- Slurm most often calls
ResumeProgram
with an ordered node list, andResumeProgram
groups nodes into launch-batches based on the order they are passed in. As a result, launch-batches aren’t correlated correctly with node used by each job. - Example: Two jobs are submitted in close succession. If Job1 is allocated to
queue1-dy-c5xlarge-[1, 3-4]
, Job 2 is allocated toqueue1-dy-c5xlarge-[2, 5-6]
, andmax_batch_size=3
is set. Slurm callsResumeProgram
withqueue1-dy-c5xlarge-[1-6]
, and launch-batches arequeue1-dy-c5xlarge-[1-3]
, andqueue1-dy-c5xlarge-[4-6]
, which don’t correspond to the two jobs directly - Using --contiguous when submitting your jobs might help.
- Slurm most often calls
- The maximum number of nodes that can be launched in a firm launch is limited to maximum number of instance-launches possible with a
RunInstances
call. This number is restricted by the limitations of theRunInstances
API. By default, the API can allow up to 500 instance to be launched in a singleRunInstances
call (max_batch_size = 500
). If you want to increase this limit, contact AWS customer support. - This functionality is only available as a per-cluster option, all queues would be subject to the same
all_or_nothing_batch
andmax_batch_size
settings. - If an instance fails during boot (for example an EC2 instance or a node fails), then the specific node that fails is automatically terminated. Then, Slurm reassigns the job to a new node or new set of nodes. In most cases, Slurm reuses the previously successful nodes and requests a replacement for the failed instance or instances. When this occurs, all-or-nothing scaling canot be guaranteed because the nodes running the job are from different launch-batches.
- This feature is only available for dynamic nodes.
- The
ResumeProgram
workflow is only applicable for dynamic nodes - Static nodes are managed separately by clustermgtd on a soft launch, best-effort basis.
- The
With all-or-nothing batches, you can opt to configure a launch-batch to be launched with a soft launch or a firm launch. Note that all-or-nothing scaling can’t be guaranteed on a per-job basis because launch-batches don’t directly correspond to all the nodes required by a job.
The following example illustrates how all-or-nothing batches functions. Homogeneous jobs are submitted in a way that matches each launch batch with all the nodes required by a job. In practice, a majority of job submissions are heterogeneous in terms of instance type and the number of nodes required.
We recommend that, when you enable all-or-nothing batches, you set the batch size to the largest number of single-instance-type instances that is required for a job. This is so that your largest launches can all be completed in a all-or-nothing launch behavior.
Consider the following conditions when setting your batch size.
- Large
max_batch_size
: Large launches can follow an all-or-nothing strategy. However, smaller launches for small jobs might be grouped together, and might collectively fail if there isn’t enough capacity for the entire group. - Small
max_batch_size
: Small launches can be separated into independent launch batches, but large launches might be separated into multiple launch batches, and an all-or-nothing launch strategy can’t be adapted use large launches.
Example Scenario
- Consider the following scenario. The following details the initial state of a cluster. There is no job in the queues. Consider the scheduled2 queue. In this queue, there are 5,000 nodes. It is in the POWER_SAVING state.
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
scheduled1* up infinite 20 idle~ scheduled1-dy-m54xlarge-[1-20]
scheduled2 up infinite 5000 idle~ scheduled2-dy-c5xlarge-[1-5000]
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
- Submit jobs that require 50 nodes for each job and use all-or-nothing behavior for launch-batches.
- Add the options
all_or_nothing_batch = True
to theResumeProgram
config located at/etc/parallelcluster/slurm_plugin/parallelcluster_slurm_resume.conf
. This turns on the all-or-nothing launch behavior for each launch-batch - The default setting is
max_batch_size = 500
. Don't change this setting, but understand that, because the queue only contains one instance type, all the nodes in a queue are most likely grouped into one launch batch. Then, because of this, if the entire batch can’t be launched, this might also cause multiple small jobs to fail. - Because the largest number of single-instance-type instances that can be required for a job is 50 for this example, each launch batch can only correspond to 1 job only. For this, set max_batch_size = 50 to
ResumeProgram
config located at/etc/parallelcluster/slurm_plugin/parallelcluster_slurm_resume.conf
. Note thatResumeProgram
config will have other parameters already in place. Don’t modify those parameters.
$ cat /etc/parallelcluster/slurm_plugin/parallelcluster_slurm_resume.conf
# DO NOT MODIFY
[slurm_resume]
cluster_name = scheduled
region = us-east-2
proxy = NONE
dynamodb_table = parallelcluster-scheduled
hosted_zone = ZXXXXXXXXXXXXXXX
dns_domain = scheduled.pcluster
use_private_hostname = false
master_private_ip = 172.x.x.x
master_hostname = ip-172-x-x-x.us-east-2.compute.internal
# Option added:
all_or_nothing_batch = True
max_batch_size = 50
- Submit 20 “sleep 300” array jobs, each requiring 50 nodes, or 1000 nodes in total. Failed jobs in most cases are re-queued, but to clearly show which jobs failed because of unsuccessful launch, specify the ——no-requeue option so that jobs that failed are removed from the queue.
# -a 1-20: submit 20 array jobs
# -N 50: each job requires 50 nodes
# -p scheduled2: submit job to scheduled2 queue/partition
# --exclusive: allocate each node to a job completely, so no 2 jobs will share a node
# --no-requeue: no automatic requeue in case of failure
# --contiguous: allocated nodes must form a contiguous set
$ sbatch --wrap "sleep 300" -a 1-20 -N 50 -p scheduled2 --exclusive --no-requeue --contiguous
- After the job is submitted, 1000 nodes total are assigned to jobs by Slurm. Because the nodes were initially in the POWER_SAVING state, Slurm places the nodes in the
POWER_UP
state and callsResumeProgram
to launch instances for the nodes.
# 1000 nodes placed into `POWER_UP` state
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
scheduled1* up infinite 20 idle~ scheduled1-dy-m54xlarge-[1-20]
scheduled2 up infinite 1000 alloc# scheduled2-dy-c5xlarge-[1-1000]
scheduled2 up infinite 4000 idle~ scheduled2-dy-c5xlarge-[1001-5000]
-
1000 nodes are launched in batches of up to 50 by
ResumeProgram
. Withall_or_nothing_batch = True
set, each launch-batch of 50 nodes, the number of instances launched is also 50 withstanding that this capacity can be reached. Otherwise, the number of instances launched is lower if the capacity is lower. If there is no capacity, all instances for each launch-batch and the entire batch fail. -
Because of capacity limitations, you can only launch 300 instances (6 batches of 50) for 300 nodes. The remaining 700 nodes (14 batches failed to launch) are reset and put back into POWER_DOWN.
# Only able to launch 300 nodes
# Failed 700 nodes placed into down automatically by `ResumeProgram`
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
scheduled1* up infinite 20 idle~ scheduled1-dy-m54xlarge-[1-20]
scheduled2 up infinite 300 alloc# scheduled2-dy-c5xlarge-[1-300]
scheduled2 up infinite 700 down# scheduled2-dy-c5xlarge-[301-1000]
scheduled2 up infinite 4000 idle~ scheduled2-dy-c5xlarge-[1001-5000]
- Because of how the job was setup and because of the max_batch_size setting, the node set is allocated to jobs according to the launch-batches. Six jobs run successfully because there are six successful launch batch. The other jobs failed because their launch batch also failed.
# 6 job able to run, the rest failed and removed from queue because no-requeue
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
872_1 scheduled wrap ubuntu CF 1:25 50 scheduled2-dy-c5xlarge-[1-50]
872_2 scheduled wrap ubuntu CF 1:25 50 scheduled2-dy-c5xlarge-[51-100]
872_3 scheduled wrap ubuntu CF 1:25 50 scheduled2-dy-c5xlarge-[101-150]
872_4 scheduled wrap ubuntu CF 1:25 50 scheduled2-dy-c5xlarge-[151-200]
872_5 scheduled wrap ubuntu CF 1:25 50 scheduled2-dy-c5xlarge-[201-250]
872_6 scheduled wrap ubuntu CF 1:25 50 scheduled2-dy-c5xlarge-[251-300]
# After a bit of time, jobs are up and running, 700 failed to launch nodes in POWER_DOWN
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
872_5 scheduled wrap ubuntu R 0:01 50 scheduled2-dy-c5xlarge-[201-250]
872_1 scheduled wrap ubuntu R 1:01 50 scheduled2-dy-c5xlarge-[1-50]
872_2 scheduled wrap ubuntu R 1:01 50 scheduled2-dy-c5xlarge-[51-100]
872_3 scheduled wrap ubuntu R 1:01 50 scheduled2-dy-c5xlarge-[101-150]
872_4 scheduled wrap ubuntu R 1:01 50 scheduled2-dy-c5xlarge-[151-200]
872_6 scheduled wrap ubuntu R 1:01 50 scheduled2-dy-c5xlarge-[251-300]
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
scheduled1* up infinite 20 idle~ scheduled1-dy-m54xlarge-[1-20]
scheduled2 up infinite 700 idle% scheduled2-dy-c5xlarge-[301-1000]
scheduled2 up infinite 4000 idle~ scheduled2-dy-c5xlarge-[1001-5000]
scheduled2 up infinite 300 alloc scheduled2-dy-c5xlarge-[1-300]
You can enable the all-or-nothing scaling behavior at launch-batch level by setting all_or_nothing_batch = True
in the ResumeProgram
config at /etc/parallelcluster/slurm_plugin/parallelcluster_slurm_resume.conf
.
We recommend that, if you set all_or_nothing_batch = True
, you should set max_batch_size
to the largest number of single-instance-type instances that can be used for a job. As such, so your largest launches are made in all-or-nothing fashion.
You should consider the following conditions when setting your max_batch_size
:
- Large
max_batch_size
: Large launches can be made in all-or-nothing fashion. However, smaller launches for small jobs might be grouped together, and might fail collective if capacity for the entire group cannot be satisfied - Small
max_batch_size
: Small launches can be separated into even smaller, separate launch batches, but large launches might be separated into multiple launch batches, and all-or-nothing behavior will not apply to large launches.
ResumeProgram
logs are located in /var/log/parallelcluster/slurm_resume.log
, and might be helpful to debug and understand launch behavior. As a general rule, having control over all_or_nothing_batch
and max_batch_size
options in the ResumeProgram
workflow can allow you to have greater control over how dynamic nodes are launched.