Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregate NodeConfig status conditions #1918

Merged

Conversation

rzetelskik
Copy link
Member

@rzetelskik rzetelskik commented May 13, 2024

Description of your changes: Currently, node setup controller only sets up conditions in the form of Node%sAvailable and the like. They are not practical and can't be easily used to query the NodeConfig status. This PR makes it so that NodeConfig conditions are aggregated to the generic Available, Progressing and Degraded, by:

  • Replacing the proprietary NodeConfigCondition with metav1.Condition in NodeConfig API. What follows is that the helpers for NodeConfigCondition are no longer needed, so any related code is removed. Edit: we can't do in this API version due to metav1.Condition having a tighter validation. Left TODOs about making this change in the next API version.
  • Adding conditions for NodeConfig controller
  • Extending NodeConfig controller with conditions aggregation. Node setup controller still aggregates conditions per-node, but NodeConfig controller performs the top level aggregation. In case any of the nodes' aggregated conditions are missing, it assumes Available=False and Progressing=True for the given node.
  • Deprecating the existing condition of type NodeConfigReconciledConditionType. The standard workload conditions are now also used to calculate the deprecated condition.
  • Reacting to changes in existing E2E tests.

Which issue is resolved by this Pull Request:
Resolves #1557

/kind feature
/priority important-soon

/cc

Copy link
Contributor

@rzetelskik: GitHub didn't allow me to request PR reviews from the following users: rzetelskik.

Note that only scylladb members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

Description of your changes: WIP

Which issue is resolved by this Pull Request:
Resolves #1557

/kind feature
/priority important-soon

/cc

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@scylla-operator-bot scylla-operator-bot bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 13, 2024
@rzetelskik rzetelskik force-pushed the nodeconfig-conditions branch 3 times, most recently from 5269fcb to b7b4c9b Compare May 13, 2024 22:32
@scylla-operator-bot scylla-operator-bot bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 13, 2024
@rzetelskik rzetelskik force-pushed the nodeconfig-conditions branch 5 times, most recently from 37352a5 to c0f8abe Compare May 14, 2024 16:11
@rzetelskik
Copy link
Member Author

/test all

@rzetelskik rzetelskik force-pushed the nodeconfig-conditions branch 2 times, most recently from 7c30291 to f75cb9b Compare May 14, 2024 20:17
@rzetelskik
Copy link
Member Author

/test images e2e-gke-serial

@rzetelskik
Copy link
Member Author

/test all

@rzetelskik rzetelskik changed the title [WIP] Aggregate NodeConfig status conditions Aggregate NodeConfig status conditions May 15, 2024
@rzetelskik rzetelskik marked this pull request as ready for review May 15, 2024 11:46
@scylla-operator-bot scylla-operator-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 15, 2024
@rzetelskik
Copy link
Member Author

@tnozicka you mention here #1557 (comment) that the tests should be extended with trying out broken raid/mount configurations, but after thinking this through I don't think they should be a part of the e2e suite - condition aggregation is covered by unit tests, and invalid configurations shouldn't go past validation. Any scenarios with corrupted devices require some hacks on the host, are prone to be flaky and are difficult to even come up with - e.g. a scenario working on my local setup wouldn't reproduce in CI.

@rzetelskik
Copy link
Member Author

/cc zimnx

@tnozicka
Copy link
Member

invalid configurations shouldn't go past validation

I don't think validation has a chance to assess this. Say something is already mounted at the target path - it's not something to be assessed on the API level.

Any scenarios with corrupted devices require some hacks on the host, are prone to be flaky and are difficult to even come up with - e.g. a scenario working on my local setup wouldn't reproduce in CI.

I think it should be possible to come up with at least one without node changes, like trying to mount something over /dev/zero or similar but I don't feel strong.

@scylla-operator-bot scylla-operator-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 2, 2024
@rzetelskik
Copy link
Member Author

rebased and applied changes, @zimnx @tnozicka ptal

@rzetelskik
Copy link
Member Author

/hold cancel

@scylla-operator-bot scylla-operator-bot bot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Sep 2, 2024
@scylla-operator-bot scylla-operator-bot bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 3, 2024
@rzetelskik
Copy link
Member Author

rebased

Copy link
Contributor

scylla-operator-bot bot commented Sep 3, 2024

@rzetelskik: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gke-release-script-latest a20b428 link true /test e2e-gke-release-script-latest

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@rzetelskik
Copy link
Member Author

@rzetelskik: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gke-release-script-latest a20b428 link true /test e2e-gke-release-script-latest
ci/prow/e2e-gke-parallel 77c0be6 link true /test e2e-gke-parallel
Full PR test history. Your PR dashboard.

known manager flake
#2061 (comment)

@rzetelskik rzetelskik changed the title Aggregate NodeConfig status conditions [WIP] Aggregate NodeConfig status conditions Sep 4, 2024
@scylla-operator-bot scylla-operator-bot bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 4, 2024
@rzetelskik rzetelskik changed the title [WIP] Aggregate NodeConfig status conditions Aggregate NodeConfig status conditions Sep 4, 2024
@scylla-operator-bot scylla-operator-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 4, 2024
Comment on lines -37 to -40
cond := scyllav1alpha1.NodeConfigCondition{
Type: scyllav1alpha1.NodeConfigReconciledConditionType,
ObservedGeneration: nc.Generation,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conditions are not technically part of the API version / can be unknown or unset - but I agree, it should be kept, at least for a while, to retain compatibility and behaviour

Comment on lines -37 to -40
cond := scyllav1alpha1.NodeConfigCondition{
Type: scyllav1alpha1.NodeConfigReconciledConditionType,
ObservedGeneration: nc.Generation,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put this into syncDaemonSet though, something like

	desiredSum := int64(0)
	allReconciled := true
	for _, requiredDaemonSet := range requiredDaemonSets {
		if requiredDaemonSet == nil {
			continue
		}

		ds, _, err := resourceapply.ApplyDaemonSet(ctx, ncc.kubeClient.AppsV1(), ncc.daemonSetLister, ncc.eventRecorder, requiredDaemonSet, resourceapply.ApplyOptions{})
		if err != nil {
			return progressingConditions, fmt.Errorf("can't apply daemonset: %w", err)
		}

		desiredSum += int64(ds.Status.DesiredNumberScheduled)

		reconciled, err := controllerhelpers.IsDaemonSetRolledOut(ds)
		if err != nil {
			return nil, fmt.Errorf("can't determine is a daemonset %q is reconiled: %w", naming.ObjRef(ds), err)
		}
		if !reconciled {
			allReconciled = false
		}
	}

	status.DesiredNodeSetupCount = pointer.Ptr(desiredSum)

	reconciledCondition := metav1.Condition{
		Type:               string(scyllav1alpha1.NodeConfigReconciledConditionType),
		ObservedGeneration: nc.Generation,
		Status:             metav1.ConditionUnknown,
	}

	if allReconciled {
		reconciledCondition.Status = metav1.ConditionTrue
		reconciledCondition.Reason = "FullyReconciledAndUp"
		reconciledCondition.Message = "All operands are reconciled and available."
	} else {
		reconciledCondition.Status = metav1.ConditionFalse
		reconciledCondition.Reason = "DaemonSetNotRolledOut"
		reconciledCondition.Message = "DaemonSet isn't reconciled and fully rolled out yet."
	}
	_ = apimeta.SetStatusCondition(statusConditions, reconciledCondition)

pkg/controller/nodeconfig/sync.go Show resolved Hide resolved
@rzetelskik
Copy link
Member Author

@zimnx ping

Copy link
Collaborator

@zimnx zimnx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm
/assign tnozicka

@zimnx zimnx assigned tnozicka and unassigned rzetelskik Sep 6, 2024
@scylla-operator-bot scylla-operator-bot bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 6, 2024
Copy link
Member

@tnozicka tnozicka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
/lgtm

Comment on lines -37 to -40
cond := scyllav1alpha1.NodeConfigCondition{
Type: scyllav1alpha1.NodeConfigReconciledConditionType,
ObservedGeneration: nc.Generation,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't seen the new projection at a time I wrote this - looking at it, it changes the semantics of it but I guess that's ok

@scylla-operator-bot scylla-operator-bot bot added the lgtm Indicates that a PR is ready to be merged. label Sep 6, 2024
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rzetelskik, tnozicka, zimnx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@scylla-operator-bot scylla-operator-bot bot merged commit 2917043 into scylladb:master Sep 6, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NodeConfig needs to aggregate status conditions so they can be queried
3 participants