Home

Jump to bottom

Hanwen edited this page Oct 23, 2024 · 148 revisions

Welcome to the AWS ParallelCluster Wiki

System Administration

Schedulers ⏱🗓

Development 👨‍💻

Best Practices 👩‍💻

Debugging 🕷

Logging 🖨

Ninja Hacks 🚀

Known Issues 3.x 🚨

(3.9.1 ‐ latest) Speculative Return Stack Overflow (SRSO) mitigations introducing potential performance impact on some AMD processors
(3.11.0) Job submission failure caused by race condition in Pyxis configuration
(3.9.0‐current) Cluster creation fails on Rocky 9.4
(3.9.0‐3.10.1) Cluster update intermittently fails because some compute nodes don’t execute update procedure
(3.8.0+) Newer Linux kernels are no longer compatible with EFA and closed Source Nvidia drivers in instances with GPU Direct RDMA support
(3.8.0 ‐ 3.9.3) ParallelCluster Build Image Failing during Installation of Minitar Ruby Gem Dependency
(3.10.0) Build image fails in China regions
(3.9.0‐3.9.1) Default ThreadsPerCore Slurm setting causes reduced CPU utilization
(3.8.0-3.9.1) SharedStorageType: Efs not working on arm instances
(3.3.0‐3.9.0) Potential data loss issue when removing storage with update‐cluster in AWS ParallelCluster 3.3.0‐3.9.0
(3.4.0-3.9.0) Updating a cluster to include an EFS fs with encryption in transit fails
(3.8.0-3.9.0) Slurmd Does not Start with EFS SharedStorageType on reboot
(3.9.0-latest) SSH bootstrap cannot launch processes on remote host when using Intel MPI with Slurm 23.11
(3.0.0-latest) Build image CloudFormation stacks fail to delete after images are successfully built
(3.0.0-3.8.0) Interactive job submission through srun can fail after increasing the number of compute nodes in the cluster
(3.0.0-3.7.2) Cluster update rollback can fail when modifying the list of instance types declared in the Compute Resources
(3.6.0‐3.6.1) Slurm NodeHostName and NodeAddr mismatch for MultiNIC instance when managed DNS is disabled and EC2 Hostnames are used
(3.6.0) NVIDIA GPU nodes fail to start with custom AMI built from DLAMI
(3.0.0-3.6.0) Ptrace_scope not disabled for Ubuntu compute nodes
(3.0.0-3.6.0) Compute Nodes Belonging To More Than One Partition Causes Compute Scaling To Overscale
(3.2.0-3.5.1) GPU nodes not coming back online after scontrol reboot
(3.0.0-3.5.1) ParallelCluster CLI raises exception “module 'flask.json' has no attribute 'JSONEncoder'”
(3.3.0-3.5.1) Cluster updates can break Slurm accounting functionality
(3.3.0-3.5.0) Update cluster to remove shared EBS volumes can potentially cause node launching failures
(3.0.0-3.5.0) DCV virtual session on Ubuntu 20.04 might show a black screen
(3.3.0-3.4.1) Custom AMI creation fails on Ubuntu 20.04 during MySQL packages installation
(3.3.0-3.4.0) Slurm cluster NodeName and NodeAddr mismatch after cluster scaling
(3.0.0-3.2.1) Running nodes might be mistakenly replaced when new jobs are scheduled
(3.0.0-3.2.1) ParallelCluster API cannot create new cluster
(3.1.x) Termination of idle dynamic compute nodes potentially broken after performing a cluster update
(3.0.0-3.1.4) ParallelCluster API Stack Upgrade Fails for ECR resources
(3.0.0-3.1.4) Unable to perform cluster update when using API or documented user policies
(3.0.0-3.1.3) Unable to create cluster or custom image when using API or CLI with documented user policies
(3.0.0-3.1.3) AWSBatch Multi node Parallel jobs fail if no EBS defined in cluster
(3.1.1-3.1.2) Profiles not loaded when connected through NICE DCV session
(3.0.0-3.1.3) build image creates invalid images when using aws-cdk.aws-imagebuilder==1.153
(3.0.0 and later) build image stack deletion failed after image successfully created
(3.1.1) Issue with clusters in isolated networks
(3.0.0) Cluster scaling fails after a head node reboot on Ubuntu 18.04 and Ubuntu 20.04
(3.0.0) Deleting API Infrastructure produces CFN Stacks failure

Known Issues (3.x and 2.x) 🚨

Fixed

(3.0.2 / 2.11.3 and earlier) Custom AMI creation fails for centos7 and ubuntu1804 Issue started on 12/8/2021, resolved on 1/20/2022

Known Issues 2.x 🚨

Deprecated 🦴