Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature or Documentation Request - Continuous Deployment (eg. Blue/Green ) #6382

Open
elduds opened this issue Aug 6, 2024 · 0 comments
Open

Comments

@elduds
Copy link

elduds commented Aug 6, 2024

One major challenge I have had is the effort and roadblocks involved in releasing configuration changes and other improvements to active clusters. Oftentimes, it is impossible to make any required changes to a cluster because the compute fleet cannot be stopped or drained due to a running job, that often run for days or weeks at a time.

Many cluster update operations' Update Policy either requires labour intensive activities or is flat-out not allowed, the implicit instruction being to just build a new cluster. Ideally ParallelCluster would support frequent incremental changes, for example changing Tags

Some other examples of challenges with making changes.

  1. Feature Request: Limiting Scope of SharedStorage Updates #6357 updating the SharedStorage cluster configuration detaches and recreates all of the shared storage volumes specified in the updated configuration
  2. cluster update fails in 3.10.0, 3.9.3 #6339 In-place upgrade causes rollback
  3. Question: Slurm Accounting migration between ParallelCluster versions #6214 accounting database migration fails

Not covered would be the operational updates made outside of the Parallel Cluster configuration files, such as CustomActions scripts, and CustomSlurmSettings Include Files

While all of these could be addressed individually, a sledgehammer approach that should cover all contingencies, even those we've not yet encountered, would be explicit support or otherwise guidance on Blue/green deployments.

The benefits would be:

  1. ability to release much more frequent incremental improvements
  2. greater support for a fully automated Continuous Deployment pipeline
  3. ability to run automated smoke tests and integration tests such as submitting slurm test cases to validate the stack end-to-end
  4. ability to run other manual UAT activities if required
  5. ensure that the updated configuration works correctly before the users encounter errors
  6. Non-disuptively support all create and update operations, not just those explicitly supported by pcluster update-cluster
  7. empowered to make arbitrary configuration changes without requiring manual actions such as stopping or draining compute fleets
  8. Release / support a new version of your HPC software (eg. we link /ansys_inc/latest to /ansys_inc/v245 or whatever) and be confident the entire stack is working

Different use cases will obviously have a different mix of supporting resources such as users, persistent and transient data. For example, we keep our CFD application binaries and support scripts on an EFS filesystem that would need to be defined outside the individual cluster's SharedStorage configuration.

Our model is a "project" at a top level, with resources including

  • a persistent data store (EFS binaries)
  • s3 buckets for case, result and post-processed data
  • IAM policies that are references by our IAM Identity Centre for our users
  • IAM policies for attachment to EC2 resources for all cluster resources in the project
  • additional security groups for attachment to EC2 resources to allow access to/from project resources liek s3 buckets

One level down from the project is the ParallelCluster concept of cluster, with the familiar which includes
head node, compute queues, login nodes, lustre filesystems for in-flight data with cluster and case-specific DRA prefixes out to the "project" s3 buckets

Blue/green deployments for clusters within a project would be a powerful capability

I have also opened a case with Enterprise Support #172291072100573

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants