Skip to content

Flink Operator Loses Job Manager Contact during EKS upgrade #683

@guruguha

Description

@guruguha

We have Spotify operator v0.4.2 deployed. We also have our Flink pipelines to be rack-aware meaning that one flink cluster is deployed only one AZ mainly to reduce inter-AZ data-transfer costs. Although this helped us reduce data transfer cost, HA doesn't seem to work at all!

When there is an EKS node group upgrade happening, and a particular node that had one or more of the job managers for different clusters goes down, the Flink operator is not even aware of this scenario. The job manager goes down and the entire cluster is out.

Can someone help us understand this? I'm unable to provide any logs as the operator seems to think that the job manager is running as normal and there is no error that is logged anywhere. All the job and task manager logs are gone too.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions