Flink Operator Loses Job Manager Contact during EKS upgrade

We have Spotify operator v0.4.2 deployed. We also have our Flink pipelines to be rack-aware meaning that one flink cluster is deployed only one AZ mainly to reduce inter-AZ data-transfer costs. Although this helped us reduce data transfer cost, HA doesn't seem to work at all! 

When there is an EKS node group upgrade happening, and a particular node that had one or more of the job managers for different clusters goes down, the Flink operator is not even aware of this scenario. The job manager goes down and the entire cluster is out. 

Can someone help us understand this? I'm unable to provide any logs as the operator seems to think that the job manager is running as normal and there is no error that is logged anywhere. All the job and task manager logs are gone too. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Flink Operator Loses Job Manager Contact during EKS upgrade #683

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Flink Operator Loses Job Manager Contact during EKS upgrade #683

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions