-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempting to create a VM with more than 32 vcpus brings nexus down #3212
Comments
And here is what is in the logs on the sled that handled this request:
|
Through some shenanigans, I managed to capture the propolis error:
There's currently a limit of 32 vcpus per instance and propolis returns a 500 error if asked to do more. |
There are two issues here that might need splitting up:
|
The crash seems likely oxidecomputer/steno#26. Re: propagating the error message: If you use I think there are a few other issues here:
|
Propolis is the source of truth on the current vcpu cap which actually comes from illumos bhyve. This cap may be changed or relaxed in the future (there is work in upstream freebsd bhyve around this). It feels cleaner to me to have this be checked in the one place that knows the limit. |
That makes sense. But it seems like more work would be needed to raise that cap. If the cap varies across sleds, wouldn't we want Nexus to take this into account when choosing which sled to use for a provision? Just thinking out loud: Propolis could remain the source of truth. And we could propagate the cap outside of Propolis and into CockroachDB. Then Nexus could take this into account when selecting a sled for a new provision. If no sleds could possibly satisfy it, we could fail the request without even creating the saga. I think this would be a better user experience for the case where someone just inputs a number higher than we support. |
I'd probably just encode the limit into sled-agent for now, rather than propolis, so the cap can be communicated into nexus w/o requiring a propolis instance/zone to exist first. Once we get around to lifting that arbitrary 32-vcpu limit in bhyve, then we can make sled-agent aware of its dynamic nature, and the rest would fall out (assuming logic for handling differing limits was built into the control plane at that point). I'm sorry that the |
Will apply validations at the API level per FCS, max 32 vCPUs and 64 GBytes DRAM. |
The mitigation for this landed in #3574 |
@zephraph brought up the need for disk size limit. This is what I propose:
The size limit won't be a FCS blocker (can always raise it if customer needs a higher limit). *Update from @leftwo: I just tried on a bench gimlet and 1TiB is the largest disk size I can create. |
I've moved this to "unscheduled" for revisit if we should have sled-agent own the validation. |
@zephraph - With oxidecomputer/propolis#474 landed (and VMM reservoir #3223 just before FCS), can you please raise the VM instance size limit to the following:
Also I was off by 1 GiB on the max disk size, it should have been 1023 GiB, not 1TiB. Would you please change that as well? Finally, on second thought, the API may actually be the right place for setting all the limits since it's where documentation lives. If we do the checks in sled-agent, the checks and API docs will become disjoined. As such, you can mark this ticket close once you are done with the size limit changes. |
Yes, absolutely, I'll get on that. |
I tried to create an image from the CLI with:
which reported a timeout fairly quickly (around 5 seconds):
and nexus crashed.
I'll put the relevant log on shared storage somewhere, but here are the final few events that I extracted:
The text was updated successfully, but these errors were encountered: