Feature or Documentation Request - Continuous Deployment (eg. Blue/Green ) #6382

elduds · 2024-08-06T02:19:12Z

One major challenge I have had is the effort and roadblocks involved in releasing configuration changes and other improvements to active clusters. Oftentimes, it is impossible to make any required changes to a cluster because the compute fleet cannot be stopped or drained due to a running job, that often run for days or weeks at a time.

Many cluster update operations' Update Policy either requires labour intensive activities or is flat-out not allowed, the implicit instruction being to just build a new cluster. Ideally ParallelCluster would support frequent incremental changes, for example changing Tags

Some other examples of challenges with making changes.

Feature Request: Limiting Scope of SharedStorage Updates #6357 updating the SharedStorage cluster configuration detaches and recreates all of the shared storage volumes specified in the updated configuration
cluster update fails in 3.10.0, 3.9.3 #6339 In-place upgrade causes rollback
Question: Slurm Accounting migration between ParallelCluster versions #6214 accounting database migration fails

Not covered would be the operational updates made outside of the Parallel Cluster configuration files, such as CustomActions scripts, and CustomSlurmSettings Include Files

While all of these could be addressed individually, a sledgehammer approach that should cover all contingencies, even those we've not yet encountered, would be explicit support or otherwise guidance on Blue/green deployments.

The benefits would be:

ability to release much more frequent incremental improvements
greater support for a fully automated Continuous Deployment pipeline
ability to run automated smoke tests and integration tests such as submitting slurm test cases to validate the stack end-to-end
ability to run other manual UAT activities if required
ensure that the updated configuration works correctly before the users encounter errors
Non-disuptively support all create and update operations, not just those explicitly supported by pcluster update-cluster
empowered to make arbitrary configuration changes without requiring manual actions such as stopping or draining compute fleets
Release / support a new version of your HPC software (eg. we link /ansys_inc/latest to /ansys_inc/v245 or whatever) and be confident the entire stack is working

Different use cases will obviously have a different mix of supporting resources such as users, persistent and transient data. For example, we keep our CFD application binaries and support scripts on an EFS filesystem that would need to be defined outside the individual cluster's SharedStorage configuration.

Our model is a "project" at a top level, with resources including

a persistent data store (EFS binaries)
s3 buckets for case, result and post-processed data
IAM policies that are references by our IAM Identity Centre for our users
IAM policies for attachment to EC2 resources for all cluster resources in the project
additional security groups for attachment to EC2 resources to allow access to/from project resources liek s3 buckets

One level down from the project is the ParallelCluster concept of cluster, with the familiar which includes
head node, compute queues, login nodes, lustre filesystems for in-flight data with cluster and case-specific DRA prefixes out to the "project" s3 buckets

Blue/green deployments for clusters within a project would be a powerful capability

I have also opened a case with Enterprise Support #172291072100573

The text was updated successfully, but these errors were encountered:

hanwen-pcluste added the enhancement label Aug 6, 2024

elduds mentioned this issue Aug 28, 2024

Feature or Documentation Request - Continuous Deployment (eg. Blue/Green ) aws-samples/hpcops-automation#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature or Documentation Request - Continuous Deployment (eg. Blue/Green ) #6382

Feature or Documentation Request - Continuous Deployment (eg. Blue/Green ) #6382

elduds commented Aug 6, 2024

Feature or Documentation Request - Continuous Deployment (eg. Blue/Green ) #6382

Feature or Documentation Request - Continuous Deployment (eg. Blue/Green ) #6382

Comments

elduds commented Aug 6, 2024