plugin: Replace readClusterState with existing watch events #904

sharnoff · 2024-04-15T04:12:21Z

... and a few other commits tossed in for fun.

Probably best to review commit-by-commit.

The thing I'm most worried about is that this potentially may change buffer handling.

Might be a good idea to go for #840 first, but I suspect that might also have its own issues.

This is a precursor to #517, to minimize the number of places we need to change resource handling.

Omrigan · 2024-04-15T19:55:23Z

Do you mean these watch events?

autoscaling/pkg/plugin/plugin.go

Lines 150 to 160 in 76ceab2

 vwc := vmWatchCallbacks{ 

 submitConfigUpdated: func(logger *zap.Logger, pod util.NamespacedName, newCfg api.VmConfig) { 

 pushToQueue(logger, pod.Name, func() { p.handleVMConfigUpdated(hlogger, pod, newCfg) }) 

 }, 

 submitBoundsChanged: func(logger *zap.Logger, vm *api.VmInfo, podName string) { 

 pushToQueue(logger, vm.Name, func() { p.handleUpdatedScalingBounds(hlogger, vm, podName) }) 

 }, 

 submitNonAutoscalingVmUsageChanged: func(logger *zap.Logger, vm *api.VmInfo, podName string) { 

 pushToQueue(logger, vm.Name, func() { p.handleNonAutoscalingUsageChange(hlogger, vm, podName) }) 

 }, 

 }

I am trying to understand how this relates to #873. For #873 my understanding was to handle those callbacks from pod watcher, and get rid of VM watcher entirely.

sharnoff · 2024-04-15T22:09:00Z

Do you mean these watch events?

Broadly, watch events like those - but it only relies on the Pod watch events, I think. (and specifically, this PR waits for only the initial Pod "add" events to be completely processed)

I am trying to understand how this relates to #873.

Right, yeah fair. This PR should be ok - we only care about the Pod watch events. (unless I missed something - please feel free to double-check me no that)

Omrigan

Have some questions

pkg/plugin/state.go

Omrigan · 2024-04-26T10:16:03Z

pkg/plugin/plugin.go

+ // The initial state building is complete when the counter reaches zero, at which point we close
+ // the channel that this function will wait on.
+
+ var initEventsCount atomic.Int32


Can you use semaphore here instead?

Acquire(nil, n) when events are added

Release(1) when event is processed

Acquire(ctxWithTimeout, n) when waiting for events processing

Seems simpler. Implemented it in caba29f but tbh I'm not sure if it's better. WDYT?

If the incEventCount is left as a semaphore wrapper anyway, maybe instead revert to your old implementation, but wrap all in a struct synchronization primitive?

The reason I suggested semaphore is that I am wary of using raw channels and atomics in the code, where it feels like it should be a synchronization primitive.

Made it a separate type in 521597e - LMK what you think.

I imagined the implementation with initEventsCount hidden inside the new helper.

Also, you don't need incEventCount now

I imagined the implementation with initEventsCount hidden inside the new helper.

Hm, it's hard to make that kind of abstraction safe to use (because the counter wouldn't be monotonic) -- maybe it'd be fine here, but I'd rather keep the logic inlined than make an unsafe abstraction for just this purpose.

So in its current state, incEventCount or equivalent is still needed (we can just directly add to the atomic; don't know whether that's better)

pkg/plugin/plugin.go

pkg/plugin/config.go

sharnoff · 2024-05-21T01:08:02Z

Realized that the necessity of debe3b2 (from #936) is partially due to 25bbe43 here — it adds Buffer always, but there's a handful of circumstances where we don't actually need it (like when a new VM is added but doesn't go through our scheduler)

sharnoff · 2024-05-21T01:23:33Z

Adapted debe3b2 here as c8e250e to fix that issue. Impact should have been close to zero in practice, but wanted to fix it just to be safe. @Omrigan wdyt?

Omrigan · 2024-05-22T08:33:16Z

pkg/plugin/state.go

+ allowDeny: false,
+ // this may be a preexisting VM. If so, we should include it in "buffer" as long it's
+ // supposed to be handled by us (otherwise, the "buffer" will never be resolved)
+ includeBuffer: pod.Spec.SchedulerName == e.state.conf.SchedulerName,


Why do we even have SchedulerName as a part of the spec, why do we ever want it to be different?

Result of discussing this: it's part of the VM spec because neonvm is a logically separate dependency of "autoscaling", so the ability to run with/without the rest of autoscaling requires the ability to set the pod's scheduler name

Omrigan · 2024-05-22T08:38:04Z

Are you going to do rebase merge? If so, tmp commit needs to be squashed somewhere, maybe into middle commit.

After the changes from #865 that switched to fetching VmInfo from Pod objects, the VM store is no longer needed long-term, just during initial cluster state reading (which can *also* be swapped out) and for event generation. This commit changes the VM store from being stored in the AutoscaleEnforcer struct to being passed in to readClusterState as an argument.

In short, readClusterState is super complicated, separately reimplements the reserveResources() logic, and may be the source of several startup-related bugs (probably #671 and #852). So, given that we *already* have a pathway for updating our internal state from changes in the cluster (i.e. the watch events), we should just use that instead.

After the changes from #865 that switched to fetching VmInfo from Pod objects, the VM store is no longer needed long-term, just during initial cluster state reading (which can *also* be swapped out) and for event generation. This commit changes the VM store from being stored in the AutoscaleEnforcer struct to being passed in to readClusterState as an argument.

sharnoff requested a review from Omrigan April 15, 2024 04:12

This was referenced Apr 15, 2024

Add VirtualMachine overcommit factors #905

Closed

Epic: Enable overcommit #517

Open

sharnoff force-pushed the sharnoff/plugin-replace-readClusterState branch from 2b32115 to edc5942 Compare April 16, 2024 14:11

sharnoff mentioned this pull request Apr 21, 2024

Remove scheduler plugin "buffer" resources #840

Open

Omrigan reviewed Apr 26, 2024

View reviewed changes

sharnoff force-pushed the sharnoff/plugin-replace-readClusterState branch from 521597e to b1e53de Compare May 10, 2024 00:20

Omrigan approved these changes May 14, 2024

View reviewed changes

pkg/plugin/config.go Show resolved Hide resolved

sharnoff force-pushed the sharnoff/plugin-replace-readClusterState branch from b1e53de to 6555a77 Compare May 21, 2024 01:06

Omrigan approved these changes May 22, 2024

View reviewed changes

sharnoff added 3 commits May 22, 2024 09:18

plugin: Handle Buffer in reserveResources() (#904)

f7d101c

sharnoff force-pushed the sharnoff/plugin-replace-readClusterState branch from c8e250e to ceceb07 Compare May 22, 2024 16:20

sharnoff merged commit e5b592c into main May 22, 2024
15 checks passed

sharnoff added a commit that referenced this pull request May 22, 2024

plugin: Handle Buffer in reserveResources() (#904)

2b43540

sharnoff deleted the sharnoff/plugin-replace-readClusterState branch May 22, 2024 17:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

plugin: Replace readClusterState with existing watch events #904

plugin: Replace readClusterState with existing watch events #904

sharnoff commented Apr 15, 2024

Omrigan commented Apr 15, 2024

sharnoff commented Apr 15, 2024

Omrigan left a comment

Omrigan Apr 26, 2024

sharnoff May 4, 2024

Omrigan May 4, 2024

sharnoff May 9, 2024

Omrigan May 14, 2024

sharnoff May 21, 2024 •

edited

Loading

sharnoff commented May 21, 2024

sharnoff commented May 21, 2024

Omrigan May 22, 2024

sharnoff May 22, 2024 •

edited

Loading

Omrigan commented May 22, 2024

plugin: Replace readClusterState with existing watch events #904

plugin: Replace readClusterState with existing watch events #904

Conversation

sharnoff commented Apr 15, 2024

Omrigan commented Apr 15, 2024

sharnoff commented Apr 15, 2024

Omrigan left a comment

Choose a reason for hiding this comment

Omrigan Apr 26, 2024

Choose a reason for hiding this comment

sharnoff May 4, 2024

Choose a reason for hiding this comment

Omrigan May 4, 2024

Choose a reason for hiding this comment

sharnoff May 9, 2024

Choose a reason for hiding this comment

Omrigan May 14, 2024

Choose a reason for hiding this comment

sharnoff May 21, 2024 • edited Loading

Choose a reason for hiding this comment

sharnoff commented May 21, 2024

sharnoff commented May 21, 2024

Omrigan May 22, 2024

Choose a reason for hiding this comment

sharnoff May 22, 2024 • edited Loading

Choose a reason for hiding this comment

Omrigan commented May 22, 2024

sharnoff May 21, 2024 •

edited

Loading

sharnoff May 22, 2024 •

edited

Loading