Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

terraform-provider-aws-parallelcluster fails on parallelcluster 3.11.0 with login nodes enabled #6489

Open
kondakovm opened this issue Oct 22, 2024 · 4 comments

Comments

@kondakovm
Copy link

The terraform-provider-aws-parallelcluster fails while parsing the cluster status during creation on the 3.11 API with login nodes enabled, resulting in the following error:

 Error: Error while waiting for cluster to finish updating.

   with module.parallelcluster_clusters.aws-parallelcluster_cluster.managed_configs["ParallelCluster"],
   on .terraform/modules/parallelcluster_clusters/modules/clusters/main.tf line 35, in resource "aws-parallelcluster_cluster" "managed_configs":
   35: resource "aws-parallelcluster_cluster" "managed_configs" {

 json: cannot unmarshal array into Go struct field _DescribeClusterResponseContent.loginNodes of type map[string]interface {}

Despite this error, the cluster was created and is fully operational, but terraform cannot read or import it ending with the same error.
This is most likely connected to transitioning from a single login node to multiple login nodes in a pool.

Additional info:
The deployment with login nodes works on parallelcluster 3.10.1.
The deployment works on parallelcluster 3.11.0 without login nodes enabled.

Required Info:

  • AWS ParallelCluster version 3.11.0
  • Full cluster configuration without any credentials or personal data:
 Region: eu-central-1
 CustomS3Bucket: parallelcluster-custom-bucket-name
 Image:
   Os: alinux2
 SharedStorage:
   - MountDir: /home
     Name: parallelcluster_shared
     StorageType: Ebs
     EbsSettings:
       VolumeType: gp3
       Size: 1000
       DeletionPolicy: Delete
 HeadNode:
   InstanceType: c5.large
   LocalStorage:
     RootVolume:
       Size: 100
       VolumeType: gp3
       DeleteOnTermination: true
   Networking:
     SubnetId: subnet-123456789123
   Iam:
     AdditionalIamPolicies:
       - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
   Ssh:
     KeyName: parallelcluster_ssh_key
 LoginNodes:
   Pools:
     - Name: login
       Count: 1
       InstanceType: t3.small
       Ssh:
         KeyName: parallelcluster_ssh_key
       Networking:
         SubnetIds:
           - subnet-123456789123
       Iam:
         AdditionalIamPolicies:
           - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
 Scheduling:
   Scheduler: slurm
   SlurmQueues:
     - Name: queue1
       CapacityType: SPOT
       Networking:
         SubnetIds:
           - subnet-123456789123
       Iam:
         AdditionalIamPolicies:
           - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
       ComputeResources:
         - InstanceType: c5.xlarge
           MinCount: 0
           MaxCount: 10
           Name: c5xlarge
   SlurmSettings:
     QueueUpdateStrategy: TERMINATE
     Dns:
       DisableManagedDns: true
       UseEc2Hostnames: true
@kondakovm kondakovm added the 3.x label Oct 22, 2024
@hanwen-pcluste
Copy link
Contributor

Thank you for reporting the issue. We will work on a fix

@hanwen-pcluste
Copy link
Contributor

The problem is solved in ParallelCluster 3.11.1. Please use the latest version.

@kondakovm
Copy link
Author

Thank you for taking care of the issue, unfortunately, I got the same error when deploying the config with the 3.11.1 API using the parallelcluster provider:

Error: Error while waiting for cluster to finish updating.
  with module.parallelcluster_clusters.aws-parallelcluster_cluster.managed_configs["ParallelCluster"],
  on .terraform/modules/parallelcluster_clusters/modules/clusters/main.tf line 35, in resource "aws-parallelcluster_cluster" "managed_configs":
  35: resource "aws-parallelcluster_cluster" "managed_configs" {
json: cannot unmarshal array into Go struct field _DescribeClusterResponseContent.loginNodes of type map[string]interface {}

The cluster is created and fully functional, no errors in Lambda API logs, but terraform can't read/modify login nodes status.

@gmarciani
Copy link
Contributor

Hi @kondakovm ,

we are working on the issue with 3.11.1.
Will give an update there once we have more info.

Thank you for reporting the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants