Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve conditions reporting #1605

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

lburgazzoli
Copy link
Contributor

Description

Important

This PR is mainly intended to gather feedback on some work and decide how to move forward.

This work was intended to implement RHOAIENG-18216, so to make reconciliation errors be visible also in the status sub-resource of the various CRs, however it ended up being a larger chunk of work. In case we decide to move forward with this approach, the PR must be split in smaller, incremental PRs.

I did a little bit of research on how conditions are reported by other operator and I ended up taking a lot of inspiration from how the Knative operators are handling with conditions, so:

  • have a top level, aggregating conditions
  • define a set of dependent conditions that are contributing to the state of the top level one
  • have an option to define the severity of a conditions

So now, for every API that is implemented with the reconciler framework, we now have at least two conditions:

  • Ready which report the overall status of the resource
  • ProvisioningSucceeded which reports any error eventually happening during the reconciliation

If any component has additional conditions, then it can declare them (if not all the conditions will be taken into account).
So as an example, in the ModelRegistry component I now have:

_, err := reconciler.ReconcilerFor(mgr, &componentApi.ModelRegistry{}).
    # ...
    WithConditions(
	status.ConditionDeploymentsAvailable,
	status.ConditionServerlessAvailable).
    Build(ctx)

Which results in having its status populated with the following conditions:

apiVersion: components.platform.opendatahub.io/v1alpha1
kind: ModelRegistry
metadata:
  name: default-modelregistry
spec:
  registriesNamespace: odh-model-registries
status:
  conditions:
  - lastTransitionTime: "2025-02-03T13:10:32Z"
    message: 0/1 deployments ready
    reason: DeploymentsNotReady
    status: "False"
    type: Ready
  - lastTransitionTime: "2025-02-03T13:10:32Z"
    message: 0/1 deployments ready
    observedGeneration: 1
    reason: DeploymentsNotReady
    status: "False"
    type: DeploymentsAvailable
  - lastTransitionTime: "2025-02-03T12:55:45Z"
    observedGeneration: 1
    status: "True"
    type: ProvisioningSucceeded
  - lastTransitionTime: "2025-02-03T12:55:32Z"
    status: "True"
    type: ServerlessAvailable

An the ready condition is reported a failing, because the DeploymentsAvailable conditions is not satisfied.

The DataScienceCluster has also been refactored to use the reconciler framework and as consequence behave the same, so it has a top level Ready condition, a ProvisioningSucceeded one, and a dedicated ComponentsReady condition that reports the overall status of the provisioned components (individual components conditions are still reported).

apiVersion: datasciencecluster.opendatahub.io/v1
kind: DataScienceCluster
metadata:
  name: default-dsc
spec:
  # ...   
status:
  conditions:
  - lastTransitionTime: "2025-02-03T13:12:36Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2025-02-03T13:10:42Z"
    message: NoManagedComponents
    reason: NoManagedComponents
    severity: Info
    status: "True"
    type: ComponentsReady
  - lastTransitionTime: "2025-02-03T10:07:50Z"
    status: "True"
    type: ProvisioningSucceeded
  - lastTransitionTime: "2025-02-03T10:07:25Z"
    message: Component ManagementState is set to Removed
    reason: Removed
    status: "False"
    type: CodeFlareReady
  - ...

Note

in this case the Ready condition is reported as being satisfied, even if the ComponentsReady is not. This is because the severity is marked as Info (the default is Error and it is being represented by an empty value)

This severity field can be useful to report some specific states, so as an example, the Kserve component would report the Ready condition as true, even if ServerlessAvailable and ServiceMeshAvailable are not (in this case because the component is configured explicitly to not use serverless)

apiVersion: components.platform.opendatahub.io/v1alpha1
kind: Kserve
  name: default-kserve
spec:
  # ...
status:
  conditions:
  - lastTransitionTime: "2025-02-03T13:20:06Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2025-02-03T13:20:06Z"
    status: "True"
    type: DeploymentsAvailable
  - lastTransitionTime: "2025-02-03T13:19:39Z"
    observedGeneration: 1
    status: "True"
    type: ProvisioningSucceeded
  - lastTransitionTime: "2025-02-03T13:19:13Z"
    message: 'Serving management state is set to: Removed'
    reason: Removed
    severity: Info
    status: "False"
    type: ServerlessAvailable
  - lastTransitionTime: "2025-02-03T13:19:13Z"
    message: 'Serving management state is set to: Removed'
    reason: Removed
    severity: Info
    status: "False"
    type: ServiceMeshAvailable

Important

As part of this work, some other changes have been made:

  • make the DataScienceReconciler to use the reconciler frameworks
  • improve the gc action to offer more configurable options and usable also to remove non managed components
  • remove lastHeartBeatTime from DSC's conditions

How Has This Been Tested?

Screenshot or short clip

Merge criteria

  • You have read the contributors guide.
  • Commit messages are meaningful - have a clear and concise summary and detailed explanation of what was changed and why.
  • Pull Request contains a description of the solution, a link to the JIRA issue, and to any dependent or related Pull Request.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Copy link

openshift-ci bot commented Feb 3, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link

openshift-ci bot commented Feb 3, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from lburgazzoli. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@lburgazzoli lburgazzoli force-pushed the RHOAIENG-18216-available-condition branch from 71d7b21 to 128b55c Compare February 4, 2025 08:36
Copy link

codecov bot commented Feb 4, 2025

Codecov Report

Attention: Patch coverage is 28.46482% with 671 lines in your changes missing coverage. Please review.

Project coverage is 21.79%. Comparing base (fe3c571) to head (daf6d7e).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/services/gc/gc.go 0.00% 74 Missing ⚠️
...cecluster/datasciencecluster_controller_actions.go 0.00% 73 Missing ⚠️
pkg/controller/conditions/conditions.go 71.34% 42 Missing and 9 partials ⚠️
...cecluster/datasciencecluster_controller_support.go 0.00% 49 Missing ⚠️
...ers/components/kserve/kserve_controller_actions.go 0.00% 44 Missing ⚠️
pkg/controller/reconciler/reconciler.go 56.38% 37 Missing and 4 partials ⚠️
pkg/controller/conditions/conditions_support.go 63.63% 20 Missing and 8 partials ⚠️
.../modelregistry/modelregistry_controller_actions.go 0.00% 26 Missing ⚠️
...pelines/datasciencepipelines_controller_actions.go 0.00% 24 Missing ⚠️
...atasciencecluster/datasciencecluster_controller.go 0.00% 23 Missing ⚠️
... and 35 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1605      +/-   ##
==========================================
+ Coverage   19.98%   21.79%   +1.81%     
==========================================
  Files         162      166       +4     
  Lines       11010    11397     +387     
==========================================
+ Hits         2200     2484     +284     
- Misses       8573     8648      +75     
- Partials      237      265      +28     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Todo
Development

Successfully merging this pull request may close these issues.

1 participant