Skip to content

Self patch a Cluster Used for Submitting Multi node Parallel Jobs through AWS Batch

Nathan Stornetta edited this page Nov 25, 2019 · 3 revisions

A small number of users have been affected by an issue in submitting multi-node parallel jobs with AWS Batch as the job scheduler. This issue is fixable in one of two ways.

The default recommended course of action is to upgrade to AWS ParallelCluster 2.5.0 or later by running the standard command for updating the software: pip3 install aws-parallelcluster --upgrade --user. In the event that a software upgrade is not feasible, then this issue can also be resolved through self-patching by applying the script below that has been authored by the AWS ParallelCluster team.

The recommended course of action if you are affected by this issue is to copy the script below and paste it into a new .sh file of your choosing on your local machine. The script can then be run for each combination of cluster and region that are affected by this issue as follows: <filename>.sh -r <affected region> -c <affected cluster name>

For example, if I copy-pasted the script below and saved my shell script as batch_patch.sh, the name of my cluster were default, and my cluster was located in us-east-1, I would run the patching script by running this command from a command line of my local machine: batch_patch.sh -r us-east-1 -c default

Please note the following requirements to successfully run this batch script:

  • The AWS CLI (instructions for downloading the CLI can be found here)
  • The zip command
  • The tar command
#!/bin/bash
set -e

usage()
{
    echo "usage: script-name.sh -r region -c cluster-name"
}


CLUSTER_NAME=
REGION=

while [ "$1" != "" ]; do
    case $1 in
        -r | --region )         shift
                                REGION=$1
                                ;;
        -c | --cluster-name )   shift
                                CLUSTER_NAME=$1
                                ;;
        -h | --help )           usage
                                exit
                                ;;
        * )                     usage
                                exit 1
    esac
    shift
done

if [ -z "${CLUSTER_NAME}" ] || [ -z "${REGION}" ]; then
    usage
    exit
fi

STACK_NAME="parallelcluster-${CLUSTER_NAME}"
S3_BUCKET=$(aws cloudformation describe-stacks --stack-name "${STACK_NAME}" --query "Stacks[0].Outputs[?OutputKey=='ResourcesS3Bucket'].OutputValue" --output text --region ${REGION})
WD="/tmp/pcluster-awsbatch-fix"

echo "Dowloading patched artifacts"
curl -s -L https://github.com/aws/aws-parallelcluster/tarball/b781987 > /tmp/pcluster-awsbatch-fix.tar.gz
mkdir -p "${WD}" && tar -xf /tmp/pcluster-awsbatch-fix.tar.gz -C "${WD}"
cd "${WD}/aws-aws-parallelcluster-b781987/cli/pcluster/resources/batch/docker/" && zip --quiet -r "${WD}/artifacts.zip" ./*

echo "Uploading patched artifacts to ${S3_BUCKET}"
aws s3 cp "${WD}/artifacts.zip" s3://${S3_BUCKET}/docker/artifacts.zip --region ${REGION}

CODEBUILD_PROJECT="${STACK_NAME}-build-docker-images-project"
BUILD_ID=$(aws codebuild start-build --project-name ${CODEBUILD_PROJECT} --query "build.id" --output text --region ${REGION})

echo "Applying patch to docker image..."
BUILD_STATUS=
while [ "${BUILD_STATUS}" != "SUCCEEDED" ] && [ "${BUILD_STATUS}" != "FAILED" ]; do
    sleep 20
    BUILD_STATUS=$(aws codebuild batch-get-builds --ids ${BUILD_ID} --query "builds[0].buildStatus" --output text --region ${REGION})
    echo "Build status is ${BUILD_STATUS}"
done 

echo "Patch applied successfully!"
Clone this wiki locally