Skip to content

(3.9.1 ‐ latest) Speculative Return Stack Overflow (SRSO) mitigations introducing potential performance impact on some AMD processors

Hanwen edited this page Oct 24, 2024 · 2 revisions

Issue description

AWS ParallelCluster 3.9.1 and newer (except on CentOS 7) include Linux kernel versions which contain mitigations for CVE-2023-20569. The Speculative Return Stack Overflow (SRSO) mitigations are enabled by default but may have a performance impact for intra-node MPI communication on machines with AMD processors. It is possible to disable these security mitigations to avoid a possible performance impact, however users should carefully consider the security implications.

You can quickly check if your AMI is affected by reading the value of the following sysfs file:

cat /sys/devices/system/cpu/vulnerabilities/spec_rstack_overflow

If the output of the command above shows a mitigation in place (e.g. Mitigation: Safe RET), your compute nodes may be affected by the performance degradation (please refer to the Linux kernel documentation for more information on the meaning of the value returned by the command).

In particular, the OSU MPI Bandwidth benchmark osu_bw can be used to detect the performance degradation but the actual impact highly depends on the specific workload being executed. When the vulnerability mitigation is in place, you may see up to 40% decrease in MPI bandwidth at large message size (over 4 kiB) in the intra-node bandwidth (via shared memory) between two processes on the same node.

Affected versions (OSes, schedulers)

All ParallelCluster versions on AMD instances where the Linux kernel is v6.1.82+, v5.15.152+ or v5.10.213+ are affected. So, all the ParallelCluster official AMIs (except for CentOS 7) starting from v3.9.1 suffer of potential performance impact on AMD instances. Moreover, any custom AMIs with Linux kernels with the security mitigations mentioned above are affected.

Mitigation

For further details of the security risks of applying the mitigation, see https://docs.kernel.org/admin-guide/hw-vuln/srso.html.

To mitigate, you need to build a new ParallelCluster AMI where the impactful SRSO module is disabled.

  1. Follow the steps 1-6 described in the official guide Modify an AWS ParallelCluster AMI
  2. Customize the AMI according to guidance from https://docs.kernel.org/admin-guide/hw-vuln/srso.html and from your Linux distribution. The following examples are provided for your convenience but may vary depending on your setup:

2.1. for Amazon Linux 2, RHEL and Rocky Linux:

sudo grubby --update-kernel=ALL --args='spec_rstack_overflow=off'

# Check that spec_rstack_overflow=off is appended to kernel args
sudo grubby --info ALL

sudo sync

2.2. for Ubuntu:

# open `/etc/default/grub` and find the variable that contains all other kernel 
# parameters used in your AMI (it should be either `GRUB_CMDLINE_LINUX` or 
# `GRUB_CMDLINX_LINUX_DEFAULT`) and add `spec_rstack_overflow=off` to the kernel
# parameters
sudo vim /etc/default/grub 
# regenerate the grub configuration
sudo grub-mkconfig -o /boot/grub/grub.cfg
  1. Reboot the instance and reconnect to the instance after the reboot.
  2. Proceeds with steps 8-11 described in the official guide Modify an AWS ParallelCluster AMI
  3. Create a cluster using the generated AMI.
Clone this wiki locally