You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One major challenge I have had is the effort and roadblocks involved in releasing configuration changes and other improvements to active clusters. Oftentimes, it is impossible to make any required changes to a cluster because the compute fleet cannot be stopped or drained due to a running job, that often run for days or weeks at a time.
Many cluster update operations' Update Policy either requires labour intensive activities or is flat-out not allowed, the implicit instruction being to just build a new cluster. Ideally ParallelCluster would support frequent incremental changes, for example changing Tags
Some other examples of challenges with making changes.
While all of these could be addressed individually, a sledgehammer approach that should cover all contingencies, even those we've not yet encountered, would be explicit support or otherwise guidance on Blue/green deployments.
The benefits would be:
ability to release much more frequent incremental improvements
greater support for a fully automated Continuous Deployment pipeline
ability to run automated smoke tests and integration tests such as submitting slurm test cases to validate the stack end-to-end
ability to run other manual UAT activities if required
ensure that the updated configuration works correctly before the users encounter errors
Non-disuptively support all create and update operations, not just those explicitly supported by pcluster update-cluster
empowered to make arbitrary configuration changes without requiring manual actions such as stopping or draining compute fleets
Release / support a new version of your HPC software (eg. we link /ansys_inc/latest to /ansys_inc/v245 or whatever) and be confident the entire stack is working
Different use cases will obviously have a different mix of supporting resources such as users, persistent and transient data. For example, we keep our CFD application binaries and support scripts on an EFS filesystem that would need to be defined outside the individual cluster's SharedStorage configuration.
Our model is a "project" at a top level, with resources including
a persistent data store (EFS binaries)
s3 buckets for case, result and post-processed data
IAM policies that are references by our IAM Identity Centre for our users
IAM policies for attachment to EC2 resources for all cluster resources in the project
additional security groups for attachment to EC2 resources to allow access to/from project resources liek s3 buckets
One level down from the project is the ParallelCluster concept of cluster, with the familiar which includes
head node, compute queues, login nodes, lustre filesystems for in-flight data with cluster and case-specific DRA prefixes out to the "project" s3 buckets
Blue/green deployments for clusters within a project would be a powerful capability
I have also opened a case with Enterprise Support #172291072100573
The text was updated successfully, but these errors were encountered:
One major challenge I have had is the effort and roadblocks involved in releasing configuration changes and other improvements to active clusters. Oftentimes, it is impossible to make any required changes to a cluster because the compute fleet cannot be stopped or drained due to a running job, that often run for days or weeks at a time.
Many cluster update operations' Update Policy either requires labour intensive activities or is flat-out not allowed, the implicit instruction being to just build a new cluster. Ideally ParallelCluster would support frequent incremental changes, for example changing Tags
Some other examples of challenges with making changes.
Not covered would be the operational updates made outside of the Parallel Cluster configuration files, such as CustomActions scripts, and CustomSlurmSettings Include Files
While all of these could be addressed individually, a sledgehammer approach that should cover all contingencies, even those we've not yet encountered, would be explicit support or otherwise guidance on Blue/green deployments.
The benefits would be:
pcluster update-cluster
Different use cases will obviously have a different mix of supporting resources such as users, persistent and transient data. For example, we keep our CFD application binaries and support scripts on an EFS filesystem that would need to be defined outside the individual cluster's SharedStorage configuration.
Our model is a "project" at a top level, with resources including
One level down from the project is the ParallelCluster concept of cluster, with the familiar which includes
head node, compute queues, login nodes, lustre filesystems for in-flight data with cluster and case-specific DRA prefixes out to the "project" s3 buckets
Blue/green deployments for clusters within a project would be a powerful capability
I have also opened a case with Enterprise Support #172291072100573
The text was updated successfully, but these errors were encountered: