-
Notifications
You must be signed in to change notification settings - Fork 312
(3.3.0‐3.9.0) Potential data loss issue when removing storage with update‐cluster in AWS ParallelCluster 3.3.0‐3.9.0
Starting with ParallelCluster 3.3.0, users can add and remove shared storage from a cluster with a pcluster update-cluster operation. While unmounting a filesystem, ParallelCluster normally performs a lazy unmount operation of the filesystem and then proceeds to clean up the mount point by deleting the mountdir and all subfolders under the mountdir. We identified an issue with AWS ParallelCluster versions 3.3.0 to 3.9.0 that could lead to a race condition which may result in unintended data loss if appropriate backup policies are not in place.
This issue impacts all ParallelCluster versions from 3.3.0 to 3.9.0, across all the OSes, schedulers and shared storage types.
To mitigate this issue on your existing cluster, we suggest you choose one of the options below based on your use case and the ParallelCluster version you are using. If you choose not to use either of the options below, we recommend you avoid unmounting your filesystems but if you decide to, please apply backup policies to avoid any unintended data loss.
On 2024-04-11, we published a patch release v3.9.1 that is designed to prevent this issue from occurring by deleting the mountdir but not the subfolders. This mechanism will delete the mountdir only if it's empty and thereby prevent unintended loss of data. Follow these instructions to upgrade your cluster to ParallelCluster 3.9.1.
If upgrading your clusters is not the right option, you can apply the patch to the head node using the following instructions:
-
Download the script to a working directory in your head node using one of the following commands:
-
curl https://us-east-1-aws-parallelcluster.s3.amazonaws.com/patches/avoid-recursive-delete-on-unmount/patch-recursive-delete.sh -o patch-recursive-delete.sh
OR
-
aws s3api get-object --bucket us-east-1-aws-parallelcluster --key patches/avoid-recursive-delete-on-unmount/patch-recursive-delete.sh patch-recursive-delete.sh
Note: You will need S3 GetObject permissions on the EC2 instance you're running the s3api get-object command from.
-
-
Make the script executable using the following command:
chmod +x patch-recursive-delete.sh
-
Choose your desired request type (https or s3) and execute one of the commands below. Run the script with sudo privileges in order to modify the files in /etc/chef:
- with https:
sudo ./patch-recursive-delete.sh https
- with s3 (GetObject permissions and AWS credentials are required):
sudo ./patch-recursive-delete.sh s3
- with https:
-
Following the successful execution of the script, you'll see a message stating the
Cookbook successfully patched
.
As ParallelCluster 3.9.0 allows updating shared storage without requiring a compute fleet stop, in-place patch needs to be applied at new nodes and running nodes:
Patching new nodes
To patch new compute nodes, execute the patching script with an OnNodeStart custom action.
- Add either one of the below configurations to the cluster configuration based on your choice:
# Using S3
CustomActions:
OnNodeStart:
Script: s3://us-east-1-aws-parallelcluster/patches/avoid-recursive-delete-on-unmount/patch-recursive-delete.sh
Args:
- s3
OR
# Using https
CustomActions:
OnNodeStart:
Script: https://us-east-1-aws-parallelcluster.s3.us-east-1.amazonaws.com/patches/avoid-recursive-delete-on-unmount/patch-recursive-delete.sh
Args:
- https
- Set QueueUpdateStrategy to
COMPUTE_FLEET_STOP
in cluster configuration to prevent the replacement of running compute nodes during the update. You can revert the strategy to the one you were using only once the forced update completes successfully. If the replacement of compute nodes is acceptable in your case, you can set the strategy toDRAIN
orTERMINATE
, so that running compute nodes will be replaced by new ones with the patch applied.
Scheduling:
Scheduler: slurm
SlurmSettings:
QueueUpdateStrategy: COMPUTE_FLEET_STOP
- Request a forced update, submitting the pcluster update-cluster command as follows:
pcluster update-cluster \
--region REGION \
--cluster-name CLUSTER_NAME \
--cluster-configuration CONFIG_PATH \
--force-update True
Patching running nodes
To patch running nodes you must execute the patching script on the whole fleet, either using SSM or manually leveraging the scheduler.
With SSM (recommended approach)
This procedure requires the user to have permissions arn:aws:iam::aws:policy/AmazonSSMFullAccess
and cluster nodes to have policy arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
.
-
Take note of the cluster name, as you will use it in the command below
-
Create an S3 bucket in the same region where the cluster is deployed, that will be used to store logs generated by SSM.
-
Execute the patching script on your running fleet using either one of the below commands:
- Using HTTPS to download objects from S3
# Set variables with your values
CLUSTER_NAME="dloss-0417-1"
PATCHING_SCRIPT_HTTPS_URL="https://us-east-1-aws-parallelcluster.s3.amazonaws.com/patches/avoid-recursive-delete-on-unmount/patch-recursive-delete.sh"
BUCKET="mgiacomo-workspace-eu-west-1"
REGION="eu-west-1"
aws ssm send-command \
--document-name "AWS-RunShellScript" \
--document-version "1" \
--targets "[{\"Key\":\"tag:parallelcluster:cluster-name\",\"Values\":[\"$CLUSTER_NAME\"]}]" \
--parameters "{\"workingDirectory\":[\"\"],\"executionTimeout\":[\"3600\"],\"commands\":[\"curl $PATCHING_SCRIPT_HTTPS_URL -o patch-recursive-delete.sh\",\"chmod +x patch-recursive-delete.sh\",\"sudo ./patch-recursive-delete.sh https\"]}" \
--comment "pcluster-patch-recursive-delete" \
--timeout-seconds 600 \
--max-concurrency "50" \
--max-errors "0" \
--output-s3-bucket-name "$BUCKET" \
--output-s3-key-prefix "ssm/run-command/pcluster-patch-recursive-delete" \
--cloud-watch-output-config '{"CloudWatchOutputEnabled":true}' \
--region $REGION
-
- Using AWS CLI to download objects from S3
# Set variables with your values
CLUSTER_NAME="dloss-0417-1"
PATCHING_SCRIPT_BUCKET="us-east-1-aws-parallelcluster"
PATCHING_SCRIPT_KEY="patches/avoid-recursive-delete-on-unmount/patch-recursive-delete.sh"
BUCKET="mgiacomo-workspace-eu-west-1"
REGION="eu-west-1"
aws ssm send-command \
--document-name "AWS-RunShellScript" \
--document-version "1" \
--targets "[{\"Key\":\"tag:parallelcluster:cluster-name\",\"Values\":[\"$CLUSTER_NAME\"]}]" \
--parameters "{\"workingDirectory\":[\"\"],\"executionTimeout\":[\"3600\"],\"commands\":[\"aws s3api get-object --bucket $PATCHING_SCRIPT_BUCKET --key $PATCHING_SCRIPT_KEY patch-recursive-delete.sh\",\"chmod +x patch-recursive-delete.sh\",\"sudo ./patch-recursive-delete.sh s3\"]}" \
--comment "pcluster-patch-recursive-delete" \
--timeout-seconds 600 \
--max-concurrency "50" \
--max-errors "0" \
--output-s3-bucket-name "$BUCKET" \
--output-s3-key-prefix "ssm/run-command/pcluster-patch-recursive-delete" \
--cloud-watch-output-config '{"CloudWatchOutputEnabled":true}' \
--region $REGION
- Monitor the execution on the SSM Console.
Without SSM
-
Patch the head node by executing Usage Instructions on it.
-
Patch the login nodes by executing Usage Instructions on each one of them.
-
Patch the compute nodes by running the patching script as a SLURM job, as follows:
sbatch -w NODE_LIST --wrap "curl https://us-east-1-aws-parallelcluster.s3.amazonaws.com/patches/avoid-recursive-delete-on-unmount/patch-recursive-delete.sh -o patch-recursive-delete.sh ; chmod +x patch-recursive-delete.sh ; sudo ./patch-recursive-delete.sh https"
Note
New login nodes cannot be patched at launch because they do not support OnNodeStart
actions. Every new login node must be manually patched following the procedure describe earlier under Patching Running Nodes section.