PCluster 3.10.1 and 3.11.0 Slurm compute daemon node configuration differs from hardware #6449

stefan-maxar · 2024-10-03T15:30:38Z

Hello,

We have been testing to upgrade from PCluster 3.8.0 to 3.11.0 and noticed some differences that impact performance after extensive testing of our applications. We run hybrid MPI-openMP applications using HPC6a.48xlarge instances and noticed that after testing PCluster 3.10.1 or 3.11.0 all of our applications are running ~40% slower than 3.8.0 using the out-of-the-box PCluster AMIs associated with either version. We narrowed down the issue by downgrading/changing versions of performance impacting software (such as EFA installer, downgrading to v1.32.0 or v1.33.0), switching how the job is submitted/run in Slurm (Hydra bootstrap and mpiexec vs PMIv2 and srun), and some other changes that did not improve the degraded performance.

Upon investigation, we noticed that the slurmd compute daemon on the HPC6a.48xlarge instances incorrectly identifies the hardware configuration, resulting in improper job placement and degraded performance. Snapshots of the slurmd from varying versions of PCluster as follows:

HPC6a.48xlarge on PCluster 3.8.0 with Slurm 23.02.7 (correct when considering NUMA node as socket):

[2024-10-03T09:14:54.114] Considering each NUMA node as a socket
[2024-10-03T09:14:54.114] Node reconfigured socket/core boundaries SocketsPerBoard=96:4(hw) CoresPerSocket=1:24(hw)
[2024-10-03T09:14:54.116] Considering each NUMA node as a socket
[2024-10-03T09:14:54.124] CPU frequency setting not configured for this node
[2024-10-03T09:14:54.130] slurmd version 23.02.7 started
[2024-10-03T09:14:54.168] slurmd started on Thu, 03 Oct 2024 09:14:54 +0000
[2024-10-03T09:14:54.169] CPUs=96 Boards=1 Sockets=4 Cores=24 Threads=1 Memory=378805 TmpDisk=40947 Uptime=240 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)

HPC6a.48xlarge on PCluster 3.10.1 with Slurm 23.11.7:

[2024-10-01T13:38:57.884] Considering each NUMA node as a socket
[2024-10-01T13:38:57.960] Considering each NUMA node as a socket
[2024-10-01T13:38:57.965] CPU frequency setting not configured for this node
[2024-10-01T13:38:58.142] slurmd version 23.11.7 started
[2024-10-01T13:38:58.221] slurmd started on Tue, 01 Oct 2024 13:38:58 +0000
[2024-10-01T13:38:58.221] CPUs=96 Boards=1 Sockets=96 Cores=1 Threads=1 Memory=378805 TmpDisk=40947 Uptime=123 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)

HPC6a.48xlarge on PCluster 3.11.0 with Slurm 23.11.10:

[2024-10-03T13:56:38.733] Considering each NUMA node as a socket
[2024-10-03T13:56:38.735] Considering each NUMA node as a socket
[2024-10-03T13:56:38.740] CPU frequency setting not configured for this node
[2024-10-03T13:56:39.387] pyxis: version v0.20.0
[2024-10-03T13:56:39.388] slurmd version 23.11.10 started
[2024-10-03T13:56:39.830] slurmd started on Thu, 03 Oct 2024 13:56:39 +0000
[2024-10-03T13:56:39.831] CPUs=96 Boards=1 Sockets=96 Cores=1 Threads=1 Memory=378805 TmpDisk=40947 Uptime=377 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)

lscpu from a HPC6a.48xlarge instance:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              96
On-line CPU(s) list: 0-95
Thread(s) per core:  1
Core(s) per socket:  48
Socket(s):           2
NUMA node(s):        4
Vendor ID:           AuthenticAMD
CPU family:          25
Model:               1
Model name:          AMD EPYC 7R13 Processor
Stepping:            1
CPU MHz:             2420.130
BogoMIPS:            5299.99
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            16384K
NUMA node0 CPU(s):   0-23
NUMA node1 CPU(s):   24-47
NUMA node2 CPU(s):   48-71
NUMA node3 CPU(s):   72-95

Is there some fix (or workaround) to properly reconfigure the node configuration in PCluster 3.11.0? It looks like some process/script that was run in 3.8.0 (e.g. line: [2024-10-03T09:14:54.114] Node reconfigured socket/core boundaries ...) is either not being run or not running properly. We'd prefer not to hard code the proper node configuration in the PCluster compute resource YAML as we dynamically spin up/down clusters and could use difference instance types in a given compute resource depending on resource availability.

Thanks for any help you can provide!

The text was updated successfully, but these errors were encountered:

hanwen-pcluste · 2024-10-14T19:56:17Z

Hi Stefan!

Thank you for the detailed description. I could reproduce the same issue. The same logs appears in the slurmd.log on compute nodes.

I am actively working on this and will keep you updated!

Thank you,
Hanwen

demartinofra · 2024-10-15T18:20:45Z

Hi Stefan,

ParallelCluster has never explicitly configured Sockets and Cores for Slurm nodes, therefore Slurm uses its defaults. This could be due to Slurm 23.11 changing the way the value for Sockets and Cores are computed. Were you able to confirm that after setting the expected values for Sockets and Cores in slurm.conf the performance degradation is resolved? I don't expect seeing relevant changes in scheduling behaviour due to the lack of Sockets/Cores configuration that justify such a big regression.

Would you be able to extract some logs showing how processes are mapped to the various cores? also if you don't mind can you share the cluster configuration and a potential reproducer?

Also if you don't mind could you share the full Slurm config from both clusters? You can retrieve it with scontrol write config /tmp/slum.conf

If Sockets and Cores configuration turns out to be a red herring here is another potential issue to look into:
The Amazon Linux Kernel versions [v6.1.82, v5.15.152, v5.10.213] contain mitigations for CVE-2023-20569. The SRSO mitigations are enabled by default but may have a performance impact for very specific workloads. It is possible to disable these security mitigations to avoid a possible performance impact, however users should carefully consider the security implications.. To disable specify spec_rstack_overflow=off as a kernel boot parameter. For further details see https://docs.kernel.org/admin-guide/hw-vuln/srso.html

Francesco

stefan-maxar · 2024-10-15T19:58:32Z

Hey @demartinofra - thanks for the reply!

For my testing, I did set the following in the PCluster configuration to force the proper configuration:

      ComputeResources:
          CustomSlurmSettings:
            Sockets: 4
            CoresPerSocket: 24

Which did yield proper configuration via slurmd (HPC6a.48xlarge on PCluster 3.11.0 with Slurm 23.11.10):

[2024-10-15T19:13:36.503] Considering each NUMA node as a socket
[2024-10-15T19:13:36.505] Considering each NUMA node as a socket
[2024-10-15T19:13:36.509] CPU frequency setting not configured for this node
[2024-10-15T19:13:36.569] pyxis: version v0.20.0
[2024-10-15T19:13:36.570] slurmd version 23.11.10 started
[2024-10-15T19:13:36.619] slurmd started on Tue, 15 Oct 2024 19:13:36 +0000
[2024-10-15T19:13:36.619] CPUs=96 Boards=1 Sockets=4 Cores=24 Threads=1 Memory=378805 TmpDisk=40947 Uptime=205 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)

For our applications, I did note some improvement in performance and recouped a few percent of the ~40% degradation using the proper hardware configuration. So, not quite red herring, but definitely not the solution either!

Regarding the SRSO mitigation - thanks for passing this along. This is news to me and is definitely something I am going to investigate further. From what I can see, HPC6a with PCluster 3.11 base AMI has that patch as you refer to:

/sys/devices/system/cpu/vulnerabilities/spec_rstack_overflow:Mitigation: safe RET, no microcode

Other than creating a custom AMI that disables this patch, do you have any suggestions for how to disable this upon instance startup via PCluster? The patch seemingly cant be removed during post install procedures because it requires instance reboot and once you reboot, slurm will detect the instance as "down" and will swap it out. I would rather not have to create a custom AMI if there is some other way to test this out. Thanks!

demartinofra · 2024-10-16T07:35:53Z

Other than creating a custom AMI that disables this patch, do you have any suggestions for how to disable this upon instance startup via PCluster?

If you want to test it real quick one option is to run the following on the compute nodes:

sudo grubby --update-kernel=ALL --args='spec_rstack_overflow=off'
sudo sync

and then reboot them through the scheduler, so that Slurm does not mark the nodes as unhealthy and the reboot is successful:

sudo -i scontrol reboot <nodelist>

stefan-maxar · 2024-10-16T20:03:02Z

Hi @demartinofra - I ran the commands you suggested to disable SRSO mitigation and rebooted via slurm which resulted in the patching being disabled:

/sys/devices/system/cpu/vulnerabilities/spec_rstack_overflow:Vulnerable, no microcode

I then ran one of our smaller-scale hybrid MPI-openMP jobs and the performance was expected with no ~40% performance degradation (I also corrected the HPC6a configuration, which also did help with performance a little). So, it definitely seems like this SRSO mitigation is the culprit for our application slowdowns...and I'll doubly confirm with our larger-scale job.

What do you suggest as a more formal workaround for the SRSO mitigation in the PCluster realm? Custom AMI? Something else? When we had performance issues because of the log4j patch, it was a simple yum remove that we could run during post install. This is a bit more involved and since we spin up and down clusters daily from the base PCluster AMI, it would be great if you could provide some recommendations. Thanks for bringing this to our attention again!

hanwen-pcluste · 2024-10-16T20:58:21Z

Hi Stefan,

We will work on a Wiki page to describe the mitigation in pcluster realm and let you know when it is done.

Thank you Stefan and Francesco for discovering the issue!
Hanwen

hanwen-pcluste · 2024-10-17T14:43:04Z

Also, please avoid using 3.11.0 because of the known issue https://github.com/aws/aws-parallelcluster/wiki/(3.11.0)-Job-submission-failure-caused-by-race-condition-in-Pyxis-configuration

hanwen-pcluste · 2024-10-23T19:25:36Z

Hi Stefan,

We've published Wiki page (3.9.1 ‐ latest) Speculative Return Stack Overflow (SRSO) mitigations introducing potential performance impact on some AMD processors

Moreover, we've released ParallelCluster 3.11.1

Cheers,
Hanwen

stefan-maxar added the 3.x label Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PCluster 3.10.1 and 3.11.0 Slurm compute daemon node configuration differs from hardware #6449

PCluster 3.10.1 and 3.11.0 Slurm compute daemon node configuration differs from hardware #6449

stefan-maxar commented Oct 3, 2024

hanwen-pcluste commented Oct 14, 2024

demartinofra commented Oct 15, 2024 •

edited

Loading

stefan-maxar commented Oct 15, 2024

demartinofra commented Oct 16, 2024

stefan-maxar commented Oct 16, 2024

hanwen-pcluste commented Oct 16, 2024

hanwen-pcluste commented Oct 17, 2024

hanwen-pcluste commented Oct 23, 2024 •

edited

Loading

PCluster 3.10.1 and 3.11.0 Slurm compute daemon node configuration differs from hardware #6449

PCluster 3.10.1 and 3.11.0 Slurm compute daemon node configuration differs from hardware #6449

Comments

stefan-maxar commented Oct 3, 2024

hanwen-pcluste commented Oct 14, 2024

demartinofra commented Oct 15, 2024 • edited Loading

stefan-maxar commented Oct 15, 2024

demartinofra commented Oct 16, 2024

stefan-maxar commented Oct 16, 2024

hanwen-pcluste commented Oct 16, 2024

hanwen-pcluste commented Oct 17, 2024

hanwen-pcluste commented Oct 23, 2024 • edited Loading

demartinofra commented Oct 15, 2024 •

edited

Loading

hanwen-pcluste commented Oct 23, 2024 •

edited

Loading