Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Fix HFC to execute updates #1793

Merged
merged 1 commit into from
Jul 5, 2024

Conversation

iurygregory
Copy link
Member

Currently any changes in HostFirmwareComponents won't trigger the firmware update on it.
This commit aims to fix the issues we have observed while testing with real hardware.

What this PR does / why we need it: metal3-io/metal3-docs#364

@metal3-io-bot metal3-io-bot requested review from honza and zaneb June 25, 2024 02:57
@metal3-io-bot metal3-io-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 25, 2024
@tuminoid
Copy link
Member

We probably want to to cherry-pick this to release-0.6 when done, if HFC doesn't work?

@iurygregory
Copy link
Member Author

@tuminoid correct

@tuminoid
Copy link
Member

@tuminoid correct

Let's schedule the bot then to do the pick:
/cherry-pick release-0.6

@metal3-io-bot
Copy link
Contributor

@tuminoid: once the present PR merges, I will cherry-pick it on top of release-0.6 in a new PR and assign it to you.

In response to this:

@tuminoid correct

Let's schedule the bot then to do the pick:
/cherry-pick release-0.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

controllers/metal3.io/baremetalhost_controller.go Outdated Show resolved Hide resolved
controllers/metal3.io/baremetalhost_controller.go Outdated Show resolved Hide resolved
controllers/metal3.io/baremetalhost_controller.go Outdated Show resolved Hide resolved
controllers/metal3.io/baremetalhost_controller.go Outdated Show resolved Hide resolved
controllers/metal3.io/baremetalhost_controller.go Outdated Show resolved Hide resolved
controllers/metal3.io/baremetalhost_controller.go Outdated Show resolved Hide resolved
controllers/metal3.io/baremetalhost_controller.go Outdated Show resolved Hide resolved
controllers/metal3.io/baremetalhost_controller.go Outdated Show resolved Hide resolved
Currently any changes in HostFirmwareComponents won't trigger the
firmware update on it.
This commit aims to fix the issues we have observed while testing
with real hardware:
- BMH stuck in preparing and executing the same firmware update
  multiple times
- not being able to define the HostFirmwareComponent manually
  and have the update applied when the BMH is in preparing.

Signed-off-by: Iury Gregory Melo Ferreira <[email protected]>
@iurygregory
Copy link
Member Author

Done, I've tested again in my setup before pushing the changes here.

@Rozzii Rozzii modified the milestones: BMO - v0.6.2, BMO - v0.7.0 Jul 3, 2024
Copy link
Member

@zaneb zaneb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments, but clearly this is a huge improvement so it may be worth merging and following up on those afterwards.

@@ -1164,6 +1164,20 @@ func (r *BareMetalHostReconciler) actionPreparing(prov provisioner.Provisioner,
return recordActionFailure(info, metal3api.PreparationError, provResult.ErrorMessage)
}

if hfcDirty && started {
hfcStillDirty, err := r.saveHostFirmwareComponents(prov, info, hfc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This saves the new status when we begin the manual cleaning, but I think I'm right in saying clearHostProvisioningSettings() does not clear them? So if there's a failure in actually applying this change, we won't retry.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there is a failure we shouldn't retry I would say, it can be a bad firmware (Dell for example has firmware separate for each model, if I use the firmware of an R750 in R640 it complains and fails)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or it could be a dropped connection or a failed write.
We shouldn't report that we've done something if we haven't done it.
If the user provides the wrong firmware we should keep trying until they realise and stop doing that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, unless we invent a way for Ironic to say "this will never work, don't try again", the current trend is to retry.

Copy link
Member Author

@iurygregory iurygregory Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

	if provResult.ErrorMessage != "" {
		if bmhDirty {
			info.log.Info("handling cleaning error in controller")
			clearHostProvisioningSettings(info.host)
		}
		if hfcDitry {
			clearHostFirmwareComponentsUpdates(hfc)
		}
		return recordActionFailure(info, metal3api.PreparationError, provResult.ErrorMessage)
	}

Something like this? a new function clearHostFirmwareComponentsUpdates to be used, since I don't think we should change clearHostProvisioningSettings

}

// Retrieve new information about the firmware components stored in ironic
components, err := prov.GetFirmwareComponents()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is only going to get called once, at the beginning of manual cleaning.
Aren't the Components expected to change in ironic only after manual cleaning is complete?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this is the problem I'm planning to fix in a separate PR, still trying to figure out how

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zane is right. But I'm also wondering why the HFC controller does not update the components afterwards.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good point. I thought at one point we deleted that from the HFC controller, but indeed it's still there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I've waited some hours to see if the HFC controller would attemp to update, but it didn't..,my best guess is that I need to change the conditions when calling updateHostFirmware in the Reconcile for HostFirmwareComponentsReconciler

@@ -1851,15 +1904,6 @@ func (r *BareMetalHostReconciler) getHostFirmwareComponents(info *reconcileInfo)

// Check if there are Updates in the Spec that are different than the Status
if meta.IsStatusConditionTrue(hfc.Status.Conditions, string(metal3api.HostFirmwareComponentsChangeDetected)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory we should check that the condition matches the current Generation (example) so we know the data is not out of date.
Not a huge deal though. (Bigger worry is actually that we miss that the Spec has been copied to the Status - which won't bump the generation - which would result in us doing the update again.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do some tests in the follow-up I'm working on.

@zaneb
Copy link
Member

zaneb commented Jul 5, 2024

/approve
This is strictly better than what we have. Should probably hold off on the backport though.

@metal3-io-bot metal3-io-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 5, 2024
Copy link
Member

@dtantsur dtantsur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with a follow-up. Thank you Iury.

/lgtm

@metal3-io-bot metal3-io-bot added the lgtm Indicates that a PR is ready to be merged. label Jul 5, 2024
@metal3-io-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dtantsur, zaneb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@metal3-io-bot metal3-io-bot merged commit f8d5da6 into metal3-io:main Jul 5, 2024
17 checks passed
@metal3-io-bot
Copy link
Contributor

@tuminoid: #1793 failed to apply on top of branch "release-0.6":

Applying: Fix HFC to execute updates
Using index info to reconstruct a base tree...
M	controllers/metal3.io/baremetalhost_controller.go
M	controllers/metal3.io/host_state_machine.go
M	controllers/metal3.io/hostfirmwarecomponents_controller.go
M	controllers/metal3.io/hostfirmwarecomponents_test.go
M	main.go
M	pkg/provisioner/fixture/fixture.go
M	pkg/provisioner/ironic/ironic.go
M	pkg/provisioner/ironic/provision_test.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/provisioner/ironic/provision_test.go
CONFLICT (content): Merge conflict in pkg/provisioner/ironic/provision_test.go
Auto-merging pkg/provisioner/ironic/ironic.go
Auto-merging pkg/provisioner/fixture/fixture.go
CONFLICT (content): Merge conflict in pkg/provisioner/fixture/fixture.go
Auto-merging main.go
CONFLICT (content): Merge conflict in main.go
Auto-merging controllers/metal3.io/hostfirmwarecomponents_test.go
CONFLICT (content): Merge conflict in controllers/metal3.io/hostfirmwarecomponents_test.go
Auto-merging controllers/metal3.io/hostfirmwarecomponents_controller.go
Auto-merging controllers/metal3.io/host_state_machine.go
Auto-merging controllers/metal3.io/baremetalhost_controller.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Fix HFC to execute updates
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

@tuminoid correct

Let's schedule the bot then to do the pick:
/cherry-pick release-0.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: Done / Closed
Development

Successfully merging this pull request may close these issues.

6 participants