Skip to content

Conversation

jgallagher
Copy link
Contributor

See #8970 for context. This doesn't fix all the ways we could get wedged if the switch zone is unhealthy after starting it up, but it does fix the case where it's so broken not even MGS is functional.

The changes here are pretty precarious (and took a bunch of retries to get something that worked!). I ran them on dublin while doing some testing for #8480, and was successfully able to start and stop the switch zone even if the sidecar was powered off and MGS was unresponsive.

I'll leave some comments on the changes below to point out details, but in general I really think #8970 warrants a bigger rework - maybe something along the lines of sled-agent-config-reconciler except limited in scope to managing and configuring the switch zone.

)
.await
.expect("Expected an infinite retry loop getting our switch ID");
let switch_slot = mgs_client
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer retry this forever here, because our caller will retry if we return an error. (There are still spots later in this function that go into infinite retry loops, so it's possible for MGS to be healthy, we get a successful response here, then get stuck in one of those. But one fix at a time.)

// If we've given the switch an underlay address, we also need to inject
// SMF properties so that tfport uplinks can be created.
if let Some((ip, Some(rack_network_config))) = underlay_info {
self.ensure_switch_zone_uplinks_configured(ip, rack_network_config)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where we used to get stuck trying forever; instead, we'll now do this uplink configuration either

a) inside the async task we were already spawning if starting the switch zone for the first time
b) inside a new async task we now spawn if we're reconfiguring the switch zone because we just got our network config from RSS

}
let me = self.clone();
let (exit_tx, exit_rx) = oneshot::channel();
*worker = Some(Task {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the new task we spawn in the "reconfiguring an existing switch zone" case.

// Then, if we have underlay info, go into a loop trying to configure
// our uplinks. As above, retry until we succeed or are told to stop.
if let Some(underlay_info) = underlay_info {
self.ensure_switch_zone_uplinks_configured_loop(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This extends what the task we were already spawning for the "start the switch zone" case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant