[reconfigurator] Should the rendezvous table RPW respect the "current target blueprint disabled" bit? #7393

jgallagher · 2025-01-23T21:27:41Z

The current target blueprint has a boolean indicating whether or not it's enabled. The blueprint_executor task respects this bit by returning before doing any work if enabled is false. The blueprint_rendezvous task currently does not inspect this bit at all. It's not immediately obvious what the rendezvous task should do here. The below is an attempt to summarize a discussion we had about this.

The rendezvous task currently has two duties:

If a debug dataset is in-service in the current target blueprint and is present in the latest inventory collection, ensure that a row for it exists in the debug_dataset rendezvous table.
If a debug dataset is expunged in the current target blueprint, ensure that if a row for that dataset exists in the debug_dataset table, that row is tombstoned.

Failing to respect enabled=false presumably doesn't affect the first duty. Since it's guarded by "is present in inventory", a dataset can only be added if the blueprint was enabled at some point in the past (otherwise, how could the dataset exist to be present in inventory), so the rendezvous RPW could have seen it and written it at some point when the blueprint (or one of its ancestors) was enabled.

Failing to respect enabled=false does affect the second duty. If a new target blueprint is set with enabled=false which expunges a debug dataset, the rendezvous RPW will tombstone the row, preventing downstream consumers (e.g., support bundles) from using that dataset, even though the blueprint_executor will do nothing, presumably leaving the debug dataset in service until the blueprint is enabled. This is inconsistent and confusing; it means the current target blueprint can be partially acted upon regardless of its enabled status. In the specific case of debug datasets, the effect seems pretty minimal (we'll stop choosing a dataset that presumably will be expunged at some point in the future), but as the rendezvous RPW grows more this may become more difficult to reason about.

Changing the rendezvous RPW to respect enabled=false has a different downside: for now, we typically leave deployed systems with a disabled current target blueprint, but briefly enable blueprint execution during upgrades to step the system forward. If we change the rendezvous RPW to respect enabled=false, then the upgrade procedures that currently instruct an operator to confirm that execution is complete will also need to instruct them to confirm that updates to the rendezvous table are complete. (#7392 would help with this, but would be a nice improvement regardless of this issue.)

An aside that is maybe more important than the issue at hand: @smklein pointed out that setting a current target blueprint with enabled=false is confusing even without the above discussion. The current target blueprint is supposed to reflect the intended system state; what does it mean if that blueprint was never enabled (and thus blueprint_executor never even attempted to make the actual system state match the intended system state)?

In discussion we talked about (at least, I may be forgetting some) four options:

Keep the existing behavior. This seems fine in practice for now, but is confusing and might become less fine (or harder to reason about) as blueprint_rendezvous grows over time.
Change the RPW to respect enabled=false. This is easier to reason about but introduces the issue described above during upgrades (i.e., waiting for both execution and rendezvous work to complete before disabling the blueprint).
Fold the work of blueprint_rendezvous into blueprint_executor, so there's only one task. This seems dicey to me because it means we now need multiple passes of blueprint_executor to be done: a first pass that sends updating configs to sleds, and then a second pass later once any changes are reflected in inventory that updates rendezvous tables. I'm pretty strongly on team "these should be two separate tasks" (but this is certainly still up for discussion!).
Change the enabled flag: instead of being attached to the current target blueprint, attach it to specific RPWs. Then we can always say "the current target blueprint is the intended state of the system", but for debugging / support / recovery, we can disable blueprint_executor and/or blueprint_rendezvous via targeted flags. This also addresses the serious "aside" noted above, since we remove the concept of a disabled current target blueprint entirely.

My preference at the moment is option 4. If that's the way we go, I think the urgency of this is kinda low, because option 1 is equivalent to option 4 if we reinterpret the "current target blueprint enabled" field as "the blueprint_executor enabled field", but we should clean that up because it's misleading and confusing.

We also had a lot of discussion around terminology and what it means for a blueprint to be "executed" or "realized". A proposal for where we might land:

"execute" and "execution" refer specifically to blueprint_executor; "execution is done" would mean "the blueprint_execution RPW completed"
"realize" and "realized" refer to all system state related to blueprints - both blueprint_executor and blueprint_rendezvous; "the blueprint has been realized" would mean that blueprint_executor successfully executed it and that blueprint_rendezvous successfully updated any relevant rendezvous tables

Using these definitions:

During on-site updates, we want to wait for blueprints to be realized, not just executed.
If the enabled=false bit only applies to blueprint_executor, then a blueprint starts being realized as soon as it is the target (via the blueprint_rendezvous RPW potentially acting on it). However, as it is not executed, it will not finish being realized.

The text was updated successfully, but these errors were encountered:

davepacheco · 2025-01-24T00:16:44Z

My preference at the moment is option 4. If that's the way we go, I think the urgency of this is kinda low, because option 1 is equivalent to option 4 if we reinterpret the "current target blueprint enabled" field as "the blueprint_executor enabled field", but we should clean that up because it's misleading and confusing.

I think that is how I've interpreted that boolean. I'm indifferent between (1) and (4) because they seem equivalent to me. I don't find where the existing boolean is to be very confusing (but I don't object to changing it either, aside from questions of priority). I think the current approach has the small benefit that the bp_target table records the history of turning the boolean on and off (and which blueprints it was turned on/off for). This isn't a big deal though because there are other, probably better ways we could track this information (like keeping status reports from the executor task).

I'd add that just because the current target blueprint is enabled doesn't mean it has been executed and -- less obviously -- just because it's disabled doesn't mean it hasn't been executed. It'd be wrong for the rendezvous RPW to not consider a debug dataset gone just because the blueprint is currently disabled -- it may have been enabled previously and the dataset may already be expunged or in the process of being expunged. (This case doesn't matter once inventory is updated to reflect the expungement, but in the meantime the RPW would be making the wrong decision.) So I think it's wrong for the RPW (or anything else) to try to infer anything about whether execution has or hasn't happened based on this flag. I feel like we want to say:

The target blueprint is always the intended state of the system (in this case, it describes the debug datasets that should exist), regardless of whether it's enabled.
The rendezvous RPW is always looking at the intersection of (what the target blueprint says should exist) and (what inventory says does exist). This has the desired behavior in both directions: when resources are added to a blueprint, they don't show up until they're also available in reality; and when resources are retired, they're retired as soon as the system says they should be retired (the new target blueprint is set), regardless of when they actually get taken offline.
Anything that wants to wait for some execution step to complete should probably wait for whatever specific condition it's looking for (e.g., some generation of zones to be available on a sled, or some specific zone to be available, or some dataset to be removed from a rendezvous table, etc.).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[reconfigurator] Should the rendezvous table RPW respect the "current target blueprint disabled" bit? #7393

[reconfigurator] Should the rendezvous table RPW respect the "current target blueprint disabled" bit? #7393

jgallagher commented Jan 23, 2025

davepacheco commented Jan 24, 2025

[reconfigurator] Should the rendezvous table RPW respect the "current target blueprint disabled" bit? #7393

[reconfigurator] Should the rendezvous table RPW respect the "current target blueprint disabled" bit? #7393

Comments

jgallagher commented Jan 23, 2025

davepacheco commented Jan 24, 2025