We have Spotify operator v0.4.2 deployed. We also have our Flink pipelines to be rack-aware meaning that one flink cluster is deployed only one AZ mainly to reduce inter-AZ data-transfer costs. Although this helped us reduce data transfer cost, HA doesn't seem to work at all!
When there is an EKS node group upgrade happening, and a particular node that had one or more of the job managers for different clusters goes down, the Flink operator is not even aware of this scenario. The job manager goes down and the entire cluster is out.
Can someone help us understand this? I'm unable to provide any logs as the operator seems to think that the job manager is running as normal and there is no error that is logged anywhere. All the job and task manager logs are gone too.