Bug: scheduler has negative "buffer" value #852

sharnoff · 2024-03-09T06:21:39Z

Environment

Prod (occurred twice recently)

Steps to reproduce

Not yet clear. Here's an example:

{"level":"info","ts":1709922373.111944,"logger":"autoscale-scheduler","caller":"plugin/state.go:1379","msg":"Adding VM pod to node","action":"read cluster state","virtualmachine":{"namespace":"default","name":"compute-falling-cake-a6d84vya"},"pod":{"namespace":"default","name":"compute-falling-cake-a6d84vya-dv647"},"node":"i-0d216a75a106c181d.us-west-2.compute.internal","verdict":{"cpu":"pod = 0.25/0.25 (node 14.25 -> 14.5 / 127.61, 0 -> 4.294967046e+06 buffer)","mem":"pod = 1Gi/1Gi (node 57Gi -> 58Gi / 519497968Ki, 0 -> -1Gi buffer"}}

I think it's entirely caused by faulty logic in (*AutoscaleEnforcer).readClusterState(), but haven't looked into it thoroughly.

And tbh, it's a little weird that readClusterState has its own implementation of reserve logic, rather than using the shared version that was added in #666.

Expected result

Any buffer value from adding a VM should be non-negative.

Actual result

The memory "buffer" value was negative (see: -1Gi buffer), and the value for CPU underflowed.

Other logs, links

The text was updated successfully, but these errors were encountered:

Omrigan · 2024-03-14T09:30:02Z

Can be solved by #840?

In short, readClusterState is super complicated, separately reimplements the reserveResources() logic, and may be the source of several startup-related bugs (probably #671 and #852). So, given that we *already* have a pathway for updating our internal state from changes in the cluster (i.e. the watch events), we should just use that instead.

sharnoff added t/bug Issue Type: Bug c/autoscaling/scheduler Component: autoscaling: k8s scheduler labels Mar 9, 2024

stradig assigned sharnoff Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: scheduler has negative "buffer" value #852

Bug: scheduler has negative "buffer" value #852

sharnoff commented Mar 9, 2024

Omrigan commented Mar 14, 2024

Bug: scheduler has negative "buffer" value #852

Bug: scheduler has negative "buffer" value #852

Comments

sharnoff commented Mar 9, 2024

Environment

Steps to reproduce

Expected result

Actual result

Other logs, links

Omrigan commented Mar 14, 2024