You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current target blueprint has a boolean indicating whether or not it's enabled. The blueprint_executor task respects this bit by returning before doing any work if enabled is false. The blueprint_rendezvous task currently does not inspect this bit at all. It's not immediately obvious what the rendezvous task should do here. The below is an attempt to summarize a discussion we had about this.
The rendezvous task currently has two duties:
If a debug dataset is in-service in the current target blueprint and is present in the latest inventory collection, ensure that a row for it exists in the debug_dataset rendezvous table.
If a debug dataset is expunged in the current target blueprint, ensure that if a row for that dataset exists in the debug_dataset table, that row is tombstoned.
Failing to respect enabled=false presumably doesn't affect the first duty. Since it's guarded by "is present in inventory", a dataset can only be added if the blueprint was enabled at some point in the past (otherwise, how could the dataset exist to be present in inventory), so the rendezvous RPW could have seen it and written it at some point when the blueprint (or one of its ancestors) was enabled.
Failing to respect enabled=falsedoes affect the second duty. If a new target blueprint is set with enabled=false which expunges a debug dataset, the rendezvous RPW will tombstone the row, preventing downstream consumers (e.g., support bundles) from using that dataset, even though the blueprint_executor will do nothing, presumably leaving the debug dataset in service until the blueprint is enabled. This is inconsistent and confusing; it means the current target blueprint can be partially acted upon regardless of its enabled status. In the specific case of debug datasets, the effect seems pretty minimal (we'll stop choosing a dataset that presumably will be expunged at some point in the future), but as the rendezvous RPW grows more this may become more difficult to reason about.
Changing the rendezvous RPW to respect enabled=false has a different downside: for now, we typically leave deployed systems with a disabled current target blueprint, but briefly enable blueprint execution during upgrades to step the system forward. If we change the rendezvous RPW to respect enabled=false, then the upgrade procedures that currently instruct an operator to confirm that execution is complete will also need to instruct them to confirm that updates to the rendezvous table are complete. (#7392 would help with this, but would be a nice improvement regardless of this issue.)
An aside that is maybe more important than the issue at hand: @smklein pointed out that setting a current target blueprint with enabled=false is confusing even without the above discussion. The current target blueprint is supposed to reflect the intended system state; what does it mean if that blueprint was never enabled (and thus blueprint_executor never even attempted to make the actual system state match the intended system state)?
In discussion we talked about (at least, I may be forgetting some) four options:
Keep the existing behavior. This seems fine in practice for now, but is confusing and might become less fine (or harder to reason about) as blueprint_rendezvous grows over time.
Change the RPW to respect enabled=false. This is easier to reason about but introduces the issue described above during upgrades (i.e., waiting for both execution and rendezvous work to complete before disabling the blueprint).
Fold the work of blueprint_rendezvous into blueprint_executor, so there's only one task. This seems dicey to me because it means we now need multiple passes of blueprint_executor to be done: a first pass that sends updating configs to sleds, and then a second pass later once any changes are reflected in inventory that updates rendezvous tables. I'm pretty strongly on team "these should be two separate tasks" (but this is certainly still up for discussion!).
Change the enabled flag: instead of being attached to the current target blueprint, attach it to specific RPWs. Then we can always say "the current target blueprint is the intended state of the system", but for debugging / support / recovery, we can disable blueprint_executor and/or blueprint_rendezvous via targeted flags. This also addresses the serious "aside" noted above, since we remove the concept of a disabled current target blueprint entirely.
My preference at the moment is option 4. If that's the way we go, I think the urgency of this is kinda low, because option 1 is equivalent to option 4 if we reinterpret the "current target blueprint enabled" field as "the blueprint_executorenabled field", but we should clean that up because it's misleading and confusing.
We also had a lot of discussion around terminology and what it means for a blueprint to be "executed" or "realized". A proposal for where we might land:
"execute" and "execution" refer specifically to blueprint_executor; "execution is done" would mean "the blueprint_execution RPW completed"
"realize" and "realized" refer to all system state related to blueprints - both blueprint_executor and blueprint_rendezvous; "the blueprint has been realized" would mean that blueprint_executor successfully executed it and that blueprint_rendezvous successfully updated any relevant rendezvous tables
Using these definitions:
During on-site updates, we want to wait for blueprints to be realized, not just executed.
If the enabled=false bit only applies to blueprint_executor, then a blueprint starts being realized as soon as it is the target (via the blueprint_rendezvous RPW potentially acting on it). However, as it is not executed, it will not finish being realized.
The text was updated successfully, but these errors were encountered:
My preference at the moment is option 4. If that's the way we go, I think the urgency of this is kinda low, because option 1 is equivalent to option 4 if we reinterpret the "current target blueprint enabled" field as "the blueprint_executor enabled field", but we should clean that up because it's misleading and confusing.
I think that is how I've interpreted that boolean. I'm indifferent between (1) and (4) because they seem equivalent to me. I don't find where the existing boolean is to be very confusing (but I don't object to changing it either, aside from questions of priority). I think the current approach has the small benefit that the bp_target table records the history of turning the boolean on and off (and which blueprints it was turned on/off for). This isn't a big deal though because there are other, probably better ways we could track this information (like keeping status reports from the executor task).
I'd add that just because the current target blueprint is enabled doesn't mean it has been executed and -- less obviously -- just because it's disabled doesn't mean it hasn't been executed. It'd be wrong for the rendezvous RPW to not consider a debug dataset gone just because the blueprint is currently disabled -- it may have been enabled previously and the dataset may already be expunged or in the process of being expunged. (This case doesn't matter once inventory is updated to reflect the expungement, but in the meantime the RPW would be making the wrong decision.) So I think it's wrong for the RPW (or anything else) to try to infer anything about whether execution has or hasn't happened based on this flag. I feel like we want to say:
The target blueprint is always the intended state of the system (in this case, it describes the debug datasets that should exist), regardless of whether it's enabled.
The rendezvous RPW is always looking at the intersection of (what the target blueprint says should exist) and (what inventory says does exist). This has the desired behavior in both directions: when resources are added to a blueprint, they don't show up until they're also available in reality; and when resources are retired, they're retired as soon as the system says they should be retired (the new target blueprint is set), regardless of when they actually get taken offline.
Anything that wants to wait for some execution step to complete should probably wait for whatever specific condition it's looking for (e.g., some generation of zones to be available on a sled, or some specific zone to be available, or some dataset to be removed from a rendezvous table, etc.).
The current target blueprint has a boolean indicating whether or not it's
enabled
. Theblueprint_executor
task respects this bit by returning before doing any work ifenabled
is false. Theblueprint_rendezvous
task currently does not inspect this bit at all. It's not immediately obvious what the rendezvous task should do here. The below is an attempt to summarize a discussion we had about this.The rendezvous task currently has two duties:
debug_dataset
rendezvous table.debug_dataset
table, that row is tombstoned.Failing to respect
enabled=false
presumably doesn't affect the first duty. Since it's guarded by "is present in inventory", a dataset can only be added if the blueprint was enabled at some point in the past (otherwise, how could the dataset exist to be present in inventory), so the rendezvous RPW could have seen it and written it at some point when the blueprint (or one of its ancestors) wasenabled
.Failing to respect
enabled=false
does affect the second duty. If a new target blueprint is set withenabled=false
which expunges a debug dataset, the rendezvous RPW will tombstone the row, preventing downstream consumers (e.g., support bundles) from using that dataset, even though theblueprint_executor
will do nothing, presumably leaving the debug dataset in service until the blueprint is enabled. This is inconsistent and confusing; it means the current target blueprint can be partially acted upon regardless of itsenabled
status. In the specific case of debug datasets, the effect seems pretty minimal (we'll stop choosing a dataset that presumably will be expunged at some point in the future), but as the rendezvous RPW grows more this may become more difficult to reason about.Changing the rendezvous RPW to respect
enabled=false
has a different downside: for now, we typically leave deployed systems with a disabled current target blueprint, but briefly enable blueprint execution during upgrades to step the system forward. If we change the rendezvous RPW to respectenabled=false
, then the upgrade procedures that currently instruct an operator to confirm that execution is complete will also need to instruct them to confirm that updates to the rendezvous table are complete. (#7392 would help with this, but would be a nice improvement regardless of this issue.)An aside that is maybe more important than the issue at hand: @smklein pointed out that setting a current target blueprint with
enabled=false
is confusing even without the above discussion. The current target blueprint is supposed to reflect the intended system state; what does it mean if that blueprint was never enabled (and thusblueprint_executor
never even attempted to make the actual system state match the intended system state)?In discussion we talked about (at least, I may be forgetting some) four options:
blueprint_rendezvous
grows over time.enabled=false
. This is easier to reason about but introduces the issue described above during upgrades (i.e., waiting for both execution and rendezvous work to complete before disabling the blueprint).blueprint_rendezvous
intoblueprint_executor
, so there's only one task. This seems dicey to me because it means we now need multiple passes ofblueprint_executor
to be done: a first pass that sends updating configs to sleds, and then a second pass later once any changes are reflected in inventory that updates rendezvous tables. I'm pretty strongly on team "these should be two separate tasks" (but this is certainly still up for discussion!).enabled
flag: instead of being attached to the current target blueprint, attach it to specific RPWs. Then we can always say "the current target blueprint is the intended state of the system", but for debugging / support / recovery, we can disableblueprint_executor
and/orblueprint_rendezvous
via targeted flags. This also addresses the serious "aside" noted above, since we remove the concept of a disabled current target blueprint entirely.My preference at the moment is option 4. If that's the way we go, I think the urgency of this is kinda low, because option 1 is equivalent to option 4 if we reinterpret the "current target blueprint
enabled
" field as "theblueprint_executor
enabled
field", but we should clean that up because it's misleading and confusing.We also had a lot of discussion around terminology and what it means for a blueprint to be "executed" or "realized". A proposal for where we might land:
blueprint_executor
; "execution is done" would mean "theblueprint_execution
RPW completed"blueprint_executor
andblueprint_rendezvous
; "the blueprint has been realized" would mean thatblueprint_executor
successfully executed it and thatblueprint_rendezvous
successfully updated any relevant rendezvous tablesUsing these definitions:
enabled=false
bit only applies toblueprint_executor
, then a blueprint starts being realized as soon as it is the target (via theblueprint_rendezvous
RPW potentially acting on it). However, as it is not executed, it will not finish being realized.The text was updated successfully, but these errors were encountered: