Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating FlinkDeployment interpreter to display error status, improving health interpreter #6073

Merged
merged 1 commit into from
Feb 5, 2025

Conversation

mszacillo
Copy link
Contributor

@mszacillo mszacillo commented Jan 21, 2025

What type of PR is this?
/kind feature

What this PR does / why we need it:
After doing some load testing of the FlinkDeployment failover (which overall has been looking quite good, but may need to address a couple edge cases), I found that the interpreter is missing one of the ephemeral states that FlinkDeployments can transition through.

Occasionally on failover, the FlinkDeployment will transition from RECONCILING -> INITIALIZING -> CREATED, before finally ending on RUNNING. Additionally, we can make use of the status.error field to further improve the health interpretation.

In this PR I've added:

  1. INITIALIZING state as an ephemeral state which we should check during health interpretation.
  2. Checking if status.error != nil. If the deployment has a published error, then we treat it as healthy, as this indicates that the job failed due to user error. In the future, we may consider adding an ignore list of errors which we would like to failover.

Which issue(s) this PR fixes:
Fixes #6023

Does this PR introduce a user-facing change?:

`karmada-controller-manager`: FlinkDeployment health interpreter improvements, adding status.error to reflected status

@karmada-bot karmada-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 21, 2025
@karmada-bot karmada-bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jan 21, 2025
@codecov-commenter
Copy link

codecov-commenter commented Jan 21, 2025

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 48.11%. Comparing base (820fd06) to head (f14e0f9).
Report is 18 commits behind head on master.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6073      +/-   ##
==========================================
- Coverage   48.33%   48.11%   -0.23%     
==========================================
  Files         666      668       +2     
  Lines       54858    55163     +305     
==========================================
+ Hits        26518    26541      +23     
- Misses      26616    26896     +280     
- Partials     1724     1726       +2     
Flag Coverage Δ
unittests 48.11% <ø> (-0.23%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@RainbowMango
Copy link
Member

@yike21 @chaunceyjiang can you take a look?

@yike21
Copy link
Member

yike21 commented Jan 22, 2025

@yike21 @chaunceyjiang can you take a look?

Ok, I'll take a look at it ASAP :-)

@RainbowMango
Copy link
Member

Here is the FlinkDeployment reference where you can find the definition of the status:
https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/reference/

@mszacillo mszacillo force-pushed the updated-interpreter branch from f705b26 to e210ed8 Compare January 27, 2025 15:09
@mszacillo mszacillo force-pushed the updated-interpreter branch from e210ed8 to f14e0f9 Compare January 27, 2025 15:56
@yike21
Copy link
Member

yike21 commented Jan 27, 2025

/lgtm
Thanks a lot! @mszacillo

@karmada-bot karmada-bot added the lgtm Indicates that a PR is ready to be merged. label Jan 27, 2025
@RainbowMango RainbowMango added this to the v1.13 milestone Feb 5, 2025
Copy link
Member

@RainbowMango RainbowMango left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!
/approve

@karmada-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: RainbowMango

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karmada-bot karmada-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 5, 2025
@karmada-bot karmada-bot merged commit d80b7d4 into karmada-io:master Feb 5, 2025
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FlinkDeployment health interpreter does not account for ImagePullBackOff Error
5 participants