OpenStack Hibernation #2722

msu8 · 2025-07-28T19:23:08Z

Adds hibernation functionality for OpenStack clusters.

Hibernation is based on restoring instances from snapshots, freeing up all of the project's resources that were taken by the instances. Replicating true hibernation in OpenStack is difficult to impossible for us - e.g. shelving in OpenStack exists, however it does not release project's resources, as it depends on how OpenStack is configured. This solution is universally compatible.

After a ClusterPool is created and claimed

$ oc patch cd $cd -n $cd_ns --type='merge' -p $'spec:\n powerState: Hibernating'

Hibernation:

Find all cluster instances by infraID prefix
Create snapshots of each instance
Collect and save all instance attributes to a secret (CD ns):
- Flavor IDs, port IDs, security groups
- Instance metadata, networking config
- Snapshot IDs for restoration
Delete the actual instances (this frees up project's resources)

$ oc patch cd $cd -n $cd_ns --type='merge' -p $'spec:\n powerState: Running'

Restoration

Load saved config from secret
Validate project has enough resources
Recreate instances from snapshots
Wait for instances to become ACTIVE
Clean up hibernation secret

This approach still has many quirks, especially the time it takes to save the snapshots and restore the instances, with the instances going through different states - the code is setup to work around those, frequently update and check the actual state of the instances.

Looking forward to feedback and more suggestions.

- openstack_actuator.go - hibernation based on snapshots/restoring - openstack/client.go - minimal OpenstackClient to support "hibernation" operations

openshift-ci · 2025-07-28T19:23:27Z

Hi @msu8. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci · 2025-07-28T19:23:41Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: msu8
Once this PR has been reviewed and has the lgtm label, please assign suhanime for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

2uasimojo · 2025-07-29T19:45:03Z

/ok-to-test

I'll start having a look at this now.

About how long does each side of the operation take?

2uasimojo · 2025-07-29T20:38:22Z

I'd like to have a discussion about the different options available here. Specifically the relative resource savings provided by shelving vs. something like suspend, traded off with the time to perform the hibernate and resume operations.

It has been a minute, so I asked grok to refresh my memory. IIUC shelving:

depends on a glance setup (is that pretty much universal with all OpenStack deployments?)
frees up all the infra resources (CPU, memory, network, storage) at the expense of big blobs in glance
takes a long time because it's pushing/pulling instance disk images across the network

In contrast, suspend:

saves RAM state to the compute node for each instance (i.e. a lot less data, and local vs over the network)
frees CPU but retains memory and local storage reservations
still stops I/O

The architectural raison d'être of hibernation in a cloud scenario is to reduce infra cost while making it faster to resume than to provision. In AWS, the former is ~2/3 savings; and the latter is about an order of magnitude (~4-5m vs 45m). In particular, although it may be a side benefit in some setups, overcommitting is not a primary goal. That is, we're not trying to provide a way to create more clusters than you have resources for, on the theory that you would only ever be running a subset of them. The controls we provide on ClusterPools -- size/runningCount, maxSize and maxConcurrent -- are more designed for quota and throttling than for overcommitting. (It feels like you could plan for saturation based on maxSize - (size - runningCount), but that doesn't account for in-flight provisions that happen when a cluster is claimed from the pool -- those clusters are only hibernated once they have finished provisioning.) And importantly, hibernation in AWS entails stopping instances, but we're not deleting them. We're also not touching other infra pieces, which would still contribute to overall quota usage.

Putting this in context with shelve vs suspend: shelving results in a 100% resource savings (glance aside) at the expense of slow hibernate/resume, with no real guarantee of overcommit capability. Whereas suspend is much quicker (right?) at the cost of effectively enforcing no overcommit.

In conclusion, I feel like suspend aligns more closely with the philosophy for which hibernation was conceived. What are your thoughts? I would like to get @abraverm's input as well.

(BTW, it would have been neat to have this discussion, e.g. via an enhancement proposal, before you sunk all this work into the implementation...)

2uasimojo · 2025-07-29T20:51:46Z

Re build errors: it looks like you need to do go mod vendor; make update vendor. (We really should have it so the latter incorporates the former, but it's harder than it sounds to get everything done in the right order :( )

You'll have to explicitly check in the new vendored files from gophercloud.

2uasimojo

Partial review. I only made it through the client.

I think perhaps we should discuss the design (see above) before spending more time on the code review process.

Also, please assign HIVE-2042 to yourself and reference it in this PR's title so it gets cross-linked.

2uasimojo · 2025-07-29T19:58:44Z

docs/hibernating-clusters.md

+
+```bash
+$ oc patch cd $cd -n $cd_ns --type='merge' -p $'spec:\n powerState: Hibernating'
+$ oc patch cd $cd -n $cd_ns --type='merge' -p $'spec:\n powerState: Running'
+```


I think this chunk isn't needed.

Hibernation is used automatically for unclaimed clusters in ClusterPools whenever size > runningCount.

Hibernation isn't specific to ClusterPools -- it can be used for standalone ClusterDeployments.

These patch commands are already listed above (~L35).

2uasimojo · 2025-07-29T20:43:41Z

pkg/openstackclient/client.go

+	GetFlavorDetails(ctx context.Context, flavorID string) (*FlavorDetails, error)
+}
+
+type Network struct {


Can't we import these typedefs from gophercloud rather than shimming them?

2uasimojo · 2025-07-29T20:45:11Z

pkg/openstackclient/client.go

+	networks := []servers.Network{
+		{
+			UUID: opts.NetworkID,
+			Port: opts.PortID,
+		},
+	}
+
+	// Convert custom options to satisfy GopherCloud
+	createOpts := &servers.CreateOpts{
+		Name:           opts.Name,
+		ImageRef:       opts.ImageRef,
+		FlavorRef:      opts.FlavorRef,
+		Networks:       networks,
+		SecurityGroups: opts.SecurityGroups,
+		Metadata:       opts.Metadata,
+	}


...like here. What's the advantage of having our own ServerCreateOpts that we simply map to servers.CreateOpts in code, vs letting the caller construct the servers.CreateOpts out of effectively the same data?

2uasimojo · 2025-07-29T20:54:42Z

pkg/openstackclient/client.go

+	server, err := servers.Get(c.computeClient, serverID).Extract()
+	return server, err


nit, could just be

Suggested change

server, err := servers.Get(c.computeClient, serverID).Extract()

return server, err

return servers.Get(c.computeClient, serverID).Extract()

2uasimojo · 2025-07-29T20:57:29Z

pkg/openstackclient/client.go

+	image, err := images.Get(c.imageClient, imageID).Extract()
+	return image, err


nit:

Suggested change

image, err := images.Get(c.imageClient, imageID).Extract()

return image, err

return images.Get(c.imageClient, imageID).Extract()

2uasimojo · 2025-07-29T21:00:47Z

pkg/openstackclient/client.go

+		return nil, err
+	}
+
+	portList, err := ports.ExtractPorts(allPages)


Another example: why would we need to repackage the resulting []ports.Port into our own []Port rather than just

Suggested change

portList, err := ports.ExtractPorts(allPages)

return ports.ExtractPorts(allPages)

?

2uasimojo · 2025-07-29T21:02:13Z

pkg/openstackclient/client.go

+}
+
+// Get the security group names for a specific server
+func (c *openstackClient) GetServerSecurityGroups(ctx context.Context, serverID string) ([]string, error) {


If it is truly desirable for this func to return a list of names rather than a list of security groups, suggest naming it

Suggested change

func (c *openstackClient) GetServerSecurityGroups(ctx context.Context, serverID string) ([]string, error) {

func (c *openstackClient) GetServerSecurityGroupNames(ctx context.Context, serverID string) ([]string, error) {

instead.

2uasimojo · 2025-07-29T21:17:41Z

pkg/openstackclient/client.go

+	if creds.TenantID != "" {
+		authOpts.TenantID = creds.TenantID
+	}
+	if creds.TenantName != "" {
+		authOpts.TenantName = creds.TenantName
+	}
+	if creds.DomainID != "" {
+		authOpts.DomainID = creds.DomainID
+	}
+	if creds.DomainName != "" {
+		authOpts.DomainName = creds.DomainName
+	}
+	if creds.UserDomainID != "" {
+		authOpts.DomainID = creds.UserDomainID
+	}
+	if creds.ProjectDomainID != "" {
+		authOpts.DomainID = creds.ProjectDomainID
+	}
+	if creds.ProjectDomainName != "" {
+		authOpts.DomainName = creds.ProjectDomainName
+	}


I think all of these conditionals are redundant. If the creds.* field is the empty string, you're just avoiding setting the corresponding `authOpts.* field to the empty string... which is already its default (nil) value.

...but again, this is only relevant if we continue to see the need for a shim struct rather than just letting the caller use the upstream gophercloud.AuthOptions directly.

2uasimojo · 2025-07-29T21:26:54Z

pkg/openstackclient/client.go

+
+// Create a client wrapper object for interacting with OpenStack.
+func NewClientFromSecret(secret *corev1.Secret) (Client, error) {
+	cloudsYaml, ok := secret.Data["clouds.yaml"]


nit: we have a const for this.

abraverm · 2025-07-30T13:51:43Z

I'd like to have a discussion about the different options available here. Specifically the relative resource savings provided by shelving vs. something like suspend, traded off with the time to perform the hibernate and resume operations.

It has been a minute, so I asked grok to refresh my memory. IIUC shelving:

depends on a glance setup (is that pretty much universal with all OpenStack deployments?)

frees up all the infra resources (CPU, memory, network, storage) at the expense of big blobs in glance

takes a long time because it's pushing/pulling instance disk images across the network

In contrast, suspend:

saves RAM state to the compute node for each instance (i.e. a lot less data, and local vs over the network)

frees CPU but retains memory and local storage reservations

still stops I/O

The architectural raison d'être of hibernation in a cloud scenario is to reduce infra cost while making it faster to resume than to provision. In AWS, the former is ~2/3 savings; and the latter is about an order of magnitude (~4-5m vs 45m). In particular, although it may be a side benefit in some setups, overcommitting is not a primary goal. That is, we're not trying to provide a way to create more clusters than you have resources for, on the theory that you would only ever be running a subset of them. The controls we provide on ClusterPools -- size/runningCount, maxSize and maxConcurrent -- are more designed for quota and throttling than for overcommitting. (It feels like you could plan for saturation based on maxSize - (size - runningCount), but that doesn't account for in-flight provisions that happen when a cluster is claimed from the pool -- those clusters are only hibernated once they have finished provisioning.) And importantly, hibernation in AWS entails stopping instances, but we're not deleting them. We're also not touching other infra pieces, which would still contribute to overall quota usage.

Putting this in context with shelve vs suspend: shelving results in a 100% resource savings (glance aside) at the expense of slow hibernate/resume, with no real guarantee of overcommit capability. Whereas suspend is much quicker (right?) at the cost of effectively enforcing no overcommit.

In conclusion, I feel like suspend aligns more closely with the philosophy for which hibernation was conceived. What are your thoughts? I would like to get @abraverm's input as well.

The main purpose of hibernation in on premise clouds is to increase number of clusters, as the resources are limited and disk is cheaper than memory and CPU. That is having more clusters in hibernation, ready to be started and save the time it takes to deploy a new cluster (about ~10 min vs 1 hour). In Openstack each project (tenant) usually has a resources quota. Suspending an instance (VM) doesn't reduce the resource usage. Shelving could reduce the resource uage but it requires additional configuration - Quota usage from placement, which in our case caused problem and we had to revert. This is why we ended up with this design.

(BTW, it would have been neat to have this discussion, e.g. via an enhancement proposal, before you sunk all this work into the implementation...)

Sure thing.

2uasimojo · 2025-07-30T15:30:53Z

(about ~10 min vs 1 hour).

If this is a good heuristic for the timing of the shelving/unshelving process, then I'm fine with it. I was just envisioning trying to shove a couple TB of disk images across a network to a storage device and thinking that might take long enough to kill the value vs spinning up a new cluster. Presumably the speed is dependent on having glance connected to particular types of storage -- more direct-attached-y and less network-y? If so, that seems like something worth calling out in the doc section you're adding here.

The main purpose of hibernation in on premise clouds is to increase number of clusters

I'll defer to your experience on this point, but I'll reiterate that hive seems to have been designed (before my time) such that overcommit capability was not a primary goal. Let me elaborate on the example I mentioned above:

(It feels like you could plan for saturation based on maxSize - (size - runningCount), but that doesn't account for in-flight provisions that happen when a cluster is claimed from the pool -- those clusters are only hibernated once they have finished provisioning.)

So simplistically let's say you have 1 CPU per cluster and a quota of 5 CPU. You think you can overcommit with the following settings:

size: 7
runningCount: 2
maxSize: 10
maxConcurrent: 5

maxSize - size = 10 - 7 = 3 -- this is the max number of claimed clusters at once.
Balanced with runningCount = 2, that's your 5 total running clusters to saturate your CPU quota.
size - runningCount = 7 - 2 = 5 -- this is your overcommit, which is okay because they're hibernating and don't affect quota.
maxConcurrent = 5 is how you're able to populate the pool initially: provisioning clusters are effectively "running" for quota purposes; you need this throttle so the pool waits until the first cluster finishes and hibernates before starting the sixth provision, etc.

Now consider the steady state of a populated pool with no claimed clusters:
==> 5 hibernating, 2 running.
Three claims come in.
We satisfy two of them with the already-running clusters, and resume one of the hibernating ones for the third claim.
==> 4 hibernating, 3 running+claimed.
Now we try to replenish the pool. We need to add 3 clusters.
==> 4 hibernating, 3 running+claimed, 3 provisioning.
That's 3+3=6 clusters contributing to quota, so one of the new ones will fail to provision.

I'm not saying you shouldn't try to play this game; just that, even when you're pretty sure you've got the math right, there are non-crazy scenarios that will still blow up, and it's because the controller logic wasn't designed with overcommit in mind.

Quota usage from placement

Interesting side note, this text looks eerily familiar. I was a Placement core while I was working in upstream OpenStack, and it's possible I had a hand in authoring this, or at least reviewing it :)

abraverm · 2025-07-30T16:54:45Z

If this is a good heuristic for the timing of the shelving/unshelving process, then I'm fine with it. I was just
envisioning trying to shove a couple TB of disk images across a network to a storage device and thinking that might take long enough to kill the value vs spinning up a new cluster. Presumably the speed is dependent on having glance connected to particular types of storage -- more direct-attached-y and less network-y? If so, that seems like something worth calling out in the doc section you're adding here.

IIRC the snapshots were actully smaller than expected, that is not 100% copy of a disk.

@msu8, please add info in the doc about the expected hibernation storage usage for normal deployment.

I'm not saying you shouldn't try to play this game; just that, even when you're pretty sure you've got the math right, there are non-crazy scenarios that will still blow up, and it's because the controller logic wasn't designed with overcommit in mind.

We could add some queue for on-prem cloud credentials that would decide if to deploy a clusters or restore from hibernation, but lets call it future work and avoid over-engineering this feature.

@msu8, could you add a warning in the documenation about how Hive would try restore a cluster from hibernation and how it doesn't solves a competing claims and clusterpools?

msu8 · 2025-07-31T13:00:33Z

Really appreciate your initial look at this @2uasimojo , and thanks for further clarifying some stuff @abraverm , I'm occupied with other projects right now, so I'll try to address the above issues as soon as possible (early next week).

- add new two-phase snapshot cleanup - add vendor

…nstack_hibernation

msu8 · 2025-08-11T16:17:01Z

Hi, I tried to address the feedback from the first review and shortened the client considerably, as well as hopefully fixed the vendor issue.

I was stuck a little bit in overengineering the actuator code - I added a two-phase snapshot deletion to free up as much resources as possible. But I’m not sure whether this is the correct way to do this, so I’m looking forward to more feedback.

The issue essentially is that if you create a basic cp (3 masters, 3 workers), the worker snapshots cannot be deleted, only the master instances can. This means that the “old” worker instances will be cleaned up in the next hibernation phase and new snapshots will be created.

However I’m not too confident on the correctness of this solution or whether this is even necessary, as this will leave out the possibility to manually recreate the master instance; if that’s even desirable. At the same time, this minimizes the resource use even more (although the snapshots don’t take away from the project resources - they need to take away from somewhere).

Also, I did not add more to the documents as I feel there’ll be more work and changes on the doc side of things.

abraverm · 2025-08-12T14:54:06Z

The issue essentially is that if you create a basic cp (3 masters, 3 workers), the worker snapshots cannot be deleted, only the master instances can. This means that the “old” worker instances will be cleaned up in the next hibernation phase and new snapshots will be created.

Having two snapshots for a short time is not ideal but there is nothing you can do about it AFAIK. I think you should document this so the user won't be surprised that during hibernation there is a surge of disk usage on Openstack.

2uasimojo

Partial.

2uasimojo · 2025-08-12T19:09:52Z

pkg/controller/hibernation/openstack_actuator.go

+		coresUsed = int(usage.TotalVCPUsUsage)
+		ramUsed = int(usage.TotalMemoryMBUsage)
+	}
+
 	// Check available resources


We generally don't bother with preflight checks for quota usage. The reason is that many factors can affect quotas asynchronously: other clusters may be provisioning/deprovisioning, hibernating/resuming, etc. -- which may or may not even be happening/visible on this hub. Point being that you're not seriously reducing the probability of hitting a quota failure by doing this check.

I won't veto it if you really want to keep it, but I see it being mostly a waste of code, including the associated maintenance etc.

@msu8 , I recommend to scratch this part considering the impact.

2uasimojo · 2025-08-12T21:21:53Z

pkg/openstackclient/client.go

@@ -28,53 +30,24 @@ type Client interface {
 	GetServer(ctx context.Context, serverID string) (*servers.Server, error)


At a glance (heh) this interface looks a lot better now, thank you!

[Later] I looked a bit deeper, still wanting to cut out some of the intermediary structs, see below.

2uasimojo · 2025-08-12T21:24:57Z

pkg/controller/hibernation/openstack_actuator.go

 		return err
 	}

-	// Create snapshots for each instance
+	// 3.


There doesn't seem to be a 2. 😇

2uasimojo · 2025-08-12T21:33:18Z

pkg/controller/hibernation/openstack_actuator.go

@@ -199,9 +180,10 @@ func (a *openstackActuator) createSnapshots(openstackClient openstackclient.Clie
 	for i, server := range servers {
 		logger.Infof("creating snapshot %d/%d for instance %s", i+1, len(servers), server.Name)

-		snapshotID, err := openstackClient.CreateServerSnapshot(ctx, server.ID, server.Name)
+		snapshotName := fmt.Sprintf("%s-hibernation-%s", server.Name, time.Now().UTC().Format("20060102-150405"))


Running time.Now() inside of the loop means we're likely to end up with different suffixes for snapshots for the different servers for the same hibernation event. Is that the desired outcome? I would have thought it would be better for debuggability if they matched.

2uasimojo · 2025-08-12T21:35:54Z

pkg/controller/hibernation/openstack_actuator.go

+	flavorCache := make(map[string]*flavors.Flavor) // Changed type

 	for _, instance := range instances {
 		requirements.Instances++

 		// Get flavor details
-		var flavor *openstackclient.FlavorDetails
+		var flavor *flavors.Flavor // Changed type


Were these Changed type comments supposed to be permanent?

2uasimojo · 2025-08-12T21:44:55Z

pkg/controller/hibernation/openstack_actuator.go

+	// Create map of current snapshot IDs for exclusion
+	currentIDs := make(map[string]bool)
+	for _, id := range currentSnapshotIDs {
+		currentIDs[id] = true
+	}


Looks like you may want to use sets from apimachinery. Lots of examples in the hive project.

2uasimojo · 2025-08-12T21:51:26Z

pkg/controller/hibernation/openstack_actuator.go

+
+	// Filter snapshots
+	for _, snapshot := range existingSnapshots {
+		if strings.Contains(snapshot.Name, infraID) && strings.Contains(snapshot.Name, "hibernation") {


Is this matching the names generated here? Is an infraID here the same as a server.Name there?

Perhaps it makes sense to make two (or more) funcs for snapshot names -- one to generate, another to match -- and put them right next to each other to make it easy for future-us to find, read, and correlate them.

2uasimojo · 2025-08-12T22:30:34Z

pkg/controller/hibernation/openstack_actuator.go

+	ctx := context.Background()
+
+	// List all images
+	existingSnapshots, err := openstackClient.ListImages(ctx, nil)


Looks like an opportunity to make use of the ListOpts param to make ListImages() return only the relevant images, avoiding the need to filter below. You could implement a custom ListOptsBuilder for that... but it might be cooler if you tagged the snapshots with the infra ID (and/or a static hibernation string) when created; and then you could use the native Tags filter here.

[Later] Ugh, I just found out that you can't tag a snapshot on creation; you have to do it afterward. That's terrible, and pretty much kills the above idea, because we can't count on not getting interrupted between those two operations. We could declare failure if that happens; or we could make the whole thing idempotent; but then we still have to have logic to look up the images based on name so we can delete or tag orphans.

[Later later] GDI they really don't want to make it easy to prefilter these results, do they? I'm just worried about scaling here. An Image is a biggish struct, and there could potentially be quite a lot of them in a given cloud at a given time, and using no ListOpts and then AllPages() is guaranteed to give you all of them, which could be a nontrivial amount of memory.

I'm out of time for today, but lmk if you have any ideas for how to make this more efficient.

2uasimojo

Partial.

(Sorry, forgot to post this last night; and I've got to pause reviewing this for other priorities today.)

2uasimojo · 2025-08-13T18:14:31Z

pkg/openstackclient/client.go

@@ -28,53 +30,24 @@ type Client interface {
 	GetServer(ctx context.Context, serverID string) (*servers.Server, error)


[Later] I looked a bit deeper, still wanting to cut out some of the intermediary structs, see below.

2uasimojo · 2025-08-13T18:17:55Z

pkg/openstackclient/client.go

+type Credentials struct {
+	AuthURL            string `json:"auth_url"`
+	Username           string `json:"username,omitempty"`
+	Password           string `json:"password,omitempty"`
+	UserID             string `json:"user_id,omitempty"`
+	ProjectID          string `json:"project_id,omitempty"`
+	ProjectName        string `json:"project_name,omitempty"`
+	UserDomainName     string `json:"user_domain_name,omitempty"`
+	UserDomainID       string `json:"user_domain_id,omitempty"`
+	ProjectDomainName  string `json:"project_domain_name,omitempty"`
+	ProjectDomainID    string `json:"project_domain_id,omitempty"`
+	RegionName         string `json:"region_name,omitempty"`
+	Interface          string `json:"interface,omitempty"`
+	IdentityAPIVersion string `json:"identity_api_version,omitempty"`
+
+	TenantID   string `json:"tenant_id,omitempty"`
+	TenantName string `json:"tenant_name,omitempty"`
+	DomainID   string `json:"domain_id,omitempty"`
+	DomainName string `json:"domain_name,omitempty"`
+	Region     string `json:"region,omitempty"`
+}
+
+type CloudsYAML struct {
+	Clouds map[string]CloudConfig `yaml:"clouds"`
+}
+
+type CloudConfig struct {
+	Auth      CloudAuth `yaml:"auth"`
+	Region    string    `yaml:"region_name"`
+	Interface string    `yaml:"interface"`
+	Version   string    `yaml:"identity_api_version"`
+}
+
+type CloudAuth struct {
+	AuthURL           string `yaml:"auth_url"`
+	Username          string `yaml:"username"`
+	Password          string `yaml:"password"`
+	ProjectID         string `yaml:"project_id"`
+	ProjectName       string `yaml:"project_name"`
+	UserDomainName    string `yaml:"user_domain_name"`
+	ProjectDomainName string `yaml:"project_domain_name"`
+	UserDomainID      string `yaml:"user_domain_id"`
+	ProjectDomainID   string `yaml:"project_domain_id"`
+}


Is there a reason we can't get all of this from upstream?

2uasimojo · 2025-08-13T18:22:08Z

pkg/openstackclient/client.go

+	networkClient *gophercloud.ServiceClient
+	credentials   *Credentials
+}
+


Good practice to include a type validation statement like

var _ Client = &openstackClient{}

This will go red if something changes such that openstackClient doesn't satisfy the Client interface.

2uasimojo · 2025-08-13T18:35:40Z

pkg/controller/hibernation/openstack_actuator.go

+func (a *openstackActuator) findInstancesByPrefix(openstackClient openstackclient.Client, prefix string) ([]ServerInfo, error) {
+	ctx := context.Background()
+
+	servers, err := openstackClient.ListServers(ctx, nil)


Okay, so here I think we probably can make this more efficient by taking advantage of opts. Worth checking with the installer team, but usually they tag cloud resources associated with an OpenShift cluster (to make it easier to find them when deprovisioning). For example, AWS resources for an OpenShift cluster are all tagged with kubernetes.io/cluster/$infraID=owned.

And, nit, consider spelling it findInstancesByInfraID(..., infraID string)

2uasimojo · 2025-08-13T18:43:20Z

pkg/controller/hibernation/openstack_actuator.go

+// basic server information
+type ServerInfo struct {
+	ID   string
+	Name string
+}


I think I get why you did this -- Server itself is kind of big to be lugging around -- but since it seems unavoidable that we're going to be slurping the whole struct across the wire anyway, the only difference it makes is when it gets garbage collected. The duration of one reconcile loop should be short enough that we're not gaining anything by releasing it "early". As long as you're passing around *Server, you'll avoid making copies, and then IMO it makes sense to get rid of this type.

...unless I missed somewhere we're saving this struct to persistent storage?

2uasimojo · 2025-08-13T20:15:26Z

pkg/controller/hibernation/openstack_actuator.go

+	// Build configuration for each instance
+	var instanceConfigs []OpenStackInstanceConfig
+	for i, serverInfo := range matchingServers {
+		serverDetails, err := openstackClient.GetServer(ctx, serverInfo.ID)


Pursuant to comment, if we were passing around *Server instead of our reduced ServerInfo, this call would be unnecessary.

2uasimojo · 2025-08-13T20:17:15Z

pkg/controller/hibernation/openstack_actuator.go

+		}
+
+		// Get security groups
+		secGroups, err := openstackClient.GetServerSecurityGroupNames(ctx, serverInfo.ID)


Do these security groups persist when the instances are deleted? Can they still be looked up by server ID at that point? My intent: Either we need to recreate the SGs as part of the resume process; or as noted above we would be able to do this lookup on the resume side instead of here.

2uasimojo · 2025-08-13T20:35:50Z

pkg/controller/hibernation/openstack_actuator.go

+type openstackActuator struct {
+	openstackClientFn func(*hivev1.ClusterDeployment, client.Client, log.FieldLogger) (openstackclient.Client, error)
+}
+


Consider a type assertion

var _ HibernationActuator = &openstackActuator{}

2uasimojo · 2025-08-13T20:44:39Z

pkg/controller/hibernation/openstack_actuator.go

+
+// openstackActuator implements HibernationActuator for OpenStack
+type openstackActuator struct {
+	openstackClientFn func(*hivev1.ClusterDeployment, client.Client, log.FieldLogger) (openstackclient.Client, error)


Just an idea: rather than passing the client around as a param to every helper func, we could cache it in the actuator struct. It would go something like:

diff --git a/pkg/controller/hibernation/openstack_actuator.go b/pkg/controller/hibernation/openstack_actuator.go index 50fbeeb7d..623d6efba 100644 --- a/pkg/controller/hibernation/openstack_actuator.go +++ b/pkg/controller/hibernation/openstack_actuator.go @@ -28,6 +28,16 @@ func init() { // openstackActuator implements HibernationActuator for OpenStack type openstackActuator struct { openstackClientFn func(*hivev1.ClusterDeployment, client.Client, log.FieldLogger) (openstackclient.Client, error) + openstackClient openstackclient.Client +} + +func (a *openstackActuator) cacheClient(cd *hivev1.ClusterDeployment, c client.Client, logger log.FieldLogger) error { + if a.openstackClient != nil { + return nil + } + var err error + a.openstackClient, err = a.openstackClientFn(cd, c, logger) + return err } // Create API client @@ -63,7 +73,7 @@ func (a *openstackActuator) StopMachines(cd *hivev1.ClusterDeployment, hiveClien logger = logger.WithField("cloud", "openstack") logger.Info("stopping machines and creating snapshots") - openstackClient, err := a.openstackClientFn(cd, hiveClient, logger) + err := a.cacheClient(cd, hiveClient, logger) if err != nil { return fmt.Errorf("failed to create OpenStack client: %v", err) } @@ -71,7 +81,7 @@ func (a *openstackActuator) StopMachines(cd *hivev1.ClusterDeployment, hiveClien infraID := cd.Spec.ClusterMetadata.InfraID // 1. - matchingServers, err := a.findInstancesByPrefix(openstackClient, infraID) + matchingServers, err := a.findInstancesByPrefix(infraID) if err != nil { return fmt.Errorf("error finding instances: %v", err) } @@ -684,10 +694,10 @@ type OpenStackInstanceConfig struct { } // Return servers that match the infraID prefix -func (a *openstackActuator) findInstancesByPrefix(openstackClient openstackclient.Client, prefix string) ([]ServerInfo, error) { +func (a *openstackActuator) findInstancesByPrefix(prefix string) ([]ServerInfo, error) { ctx := context.Background() - servers, err := openstackClient.ListServers(ctx, nil) + servers, err := a.openstackClient.ListServers(ctx, nil) if err != nil { return nil, fmt.Errorf("error listing servers: %v", err) }

...and similar changes in the other helper funcs.

The tradeoff is tighter coupling: the helpers rely on that cache func having been called, or they'll panic. I'm not sure if it's worth the characters saved by eliminating that extra param 🤷

(I would have liked for all the helpers to call something more like a.getClient().DoWhatever(...), and have getClient() do the nil check and cache the client if necessary... but then there's no good way to handle the error from the client creation :( )

2uasimojo · 2025-08-13T21:09:13Z

pkg/controller/hibernation/openstack_actuator.go

+
+	logger.Infof("found %d instances to hibernate", len(matchingServers))
+
+	if err := a.validateInstanceStates(openstackClient, matchingServers, logger); err != nil {


Again here, I'm not sure it helps us to prevalidate. Since things can happen asynchronously, we would still need to have an error path for when we try to do the real thing and it fails because an instance is "invalid".

Also for idempotence, if we are interrupted partway through, is it possible to hit the next iteration and encounter one or more instances legitimately deleting, because we already snapshotted them last time? [Later] Hm, not as the algorithm is currently written... but I don't think that flow is idempotent...

This is sending me down a mental rabbit hole. Is it possible to somehow "freeze" or "lock" an instance so that it can't be touched by another process while we're working? If so, how would we be able to touch it in the aforementioned second-run-through scenario?

This question is actually relevant in terms of losing work as well. Scenario: we create a snapshot while some process is running on the instance. The process continues after our snapshot finishes, but before we delete the instance. When we restore, we'll go back to that old state, having lost whatever progress was made in the interim. Depending on the workload, that may or may not be recoverable.

Can we Pause() the instances as the first step?

…shot naming, pausing other fixes to logging, structure, comments

openshift-merge-robot · 2025-09-12T17:03:00Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

msu8 · 2025-09-12T17:07:17Z

Hi, thanks a lot for the amazing feedback and sorry that the following commit took so much tim - some things were a bit difficult for me and I unfortunately had to focus on other work.

I think I addressed all of the feedback except for a couple off points that I think aren't as necessary or would add friction/slowdown - I'll as additional questions monday and will elaborate further. Sort of time pressed.

Removed quota validation
Fixed timestamp consistency
Streamlined snapshot fetching with using already known names
Removed redundant structs
Add client creation from upstream clientconfig
Added instance tagging
-- {
"name": "rd-ssegal-01-xczfk-worker-0-q4d76",
"flavor": "66502c57-663a-4e51-82f0-bdbd9522ace0",
"portID": "34f4f1f2-37b6-47a1-8558-98239b2e535b",
"snapshotID": "04a3aaaf-8338-4b30-96de-2e1775a97c7f",
"snapshotName": "rd-ssegal-01-xczfk-worker-0-q4d76-hibernation-20250912-155926",
"securityGroups": [
"rd-ssegal-01-xczfk-worker"
],
"clusterID": "rd-ssegal-01-xczfk",
"networkID": "f544f0d0-5051-4fd8-acdc-9edcad84f916",
"openshiftClusterID": "rd-ssegal-01-xczfk",
"metadata": {
"Name": "rd-ssegal-01-xczfk-worker",
"openshiftClusterID": "rd-ssegal-01-xczfk"
},
"tags": [ <------
"cluster-api-provider-openstack",
"openshift-machine-api-rd-ssegal-01-xczfk",
"openshiftClusterID=rd-ssegal-01-xczfk"
]
},
Added pausing/unpausing -> specificially requires a microversion to be specified otherwise it doesn't work
Other smaller changes like renaming, structure, comments, logging
Updated docs

openshift-ci · 2025-09-12T17:13:38Z

@msu8: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-openstack	`541b6dd`	link	true	`/test e2e-openstack`
ci/prow/images	`541b6dd`	link	true	`/test images`
ci/prow/periodic-images	`541b6dd`	link	true	`/test periodic-images`
ci/prow/coverage	`541b6dd`	link	true	`/test coverage`
ci/prow/e2e-pool	`541b6dd`	link	true	`/test e2e-pool`
ci/prow/verify	`541b6dd`	link	true	`/test verify`
ci/prow/security	`541b6dd`	link	true	`/test security`
ci/prow/unit	`541b6dd`	link	true	`/test unit`
ci/prow/e2e	`541b6dd`	link	true	`/test e2e`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

msu8 added 4 commits July 12, 2025 14:15

Implement OpenStack hibernation support

4e05854

- openstack_actuator.go - hibernation based on snapshots/restoring - openstack/client.go - minimal OpenstackClient to support "hibernation" operations

Add tests, better error handling, resource checking, finalize scripts

d177558

add docs, more changes

07cf30b

Merge branch 'openshift:master' into openstack_hibernation

f3004f0

openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jul 28, 2025

openshift-ci bot requested review from 2uasimojo and suhanime July 28, 2025 19:23

openshift-ci bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 29, 2025

2uasimojo reviewed Jul 29, 2025

View reviewed changes

msu8 added 2 commits August 11, 2025 17:46

- Address first wave of review feedback

cd33a48

- add new two-phase snapshot cleanup - add vendor

Merge branch 'openstack_hibernation' of github.com:msu8/hive into ope…

546b625

…nstack_hibernation

2uasimojo reviewed Aug 12, 2025

View reviewed changes

2uasimojo reviewed Aug 14, 2025

View reviewed changes

Remove quota validation, structs, client creation from upstream, snap…

541b6dd

…shot naming, pausing other fixes to logging, structure, comments

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 12, 2025

		server, err := servers.Get(c.computeClient, serverID).Extract()
		return server, err

	server, err := servers.Get(c.computeClient, serverID).Extract()
	return server, err
	return servers.Get(c.computeClient, serverID).Extract()

		image, err := images.Get(c.imageClient, imageID).Extract()
		return image, err

	image, err := images.Get(c.imageClient, imageID).Extract()
	return image, err
	return images.Get(c.imageClient, imageID).Extract()

	portList, err := ports.ExtractPorts(allPages)
	return ports.ExtractPorts(allPages)

	func (c *openstackClient) GetServerSecurityGroups(ctx context.Context, serverID string) ([]string, error) {
	func (c *openstackClient) GetServerSecurityGroupNames(ctx context.Context, serverID string) ([]string, error) {

		@@ -28,53 +30,24 @@ type Client interface {
		GetServer(ctx context.Context, serverID string) (*servers.Server, error)


		logger.Infof("found %d instances to hibernate", len(matchingServers))

		if err := a.validateInstanceStates(openstackClient, matchingServers, logger); err != nil {

OpenStack Hibernation #2722

Are you sure you want to change the base?

OpenStack Hibernation #2722

Uh oh!

Conversation

msu8 commented Jul 28, 2025

Uh oh!

openshift-ci bot commented Jul 28, 2025

Uh oh!

openshift-ci bot commented Jul 28, 2025

Uh oh!

2uasimojo commented Jul 29, 2025

Uh oh!

2uasimojo commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

2uasimojo commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

2uasimojo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abraverm commented Jul 30, 2025

Uh oh!

2uasimojo commented Jul 30, 2025

Uh oh!

abraverm commented Jul 30, 2025

Uh oh!

msu8 commented Jul 31, 2025

Uh oh!

msu8 commented Aug 11, 2025

Uh oh!

abraverm commented Aug 12, 2025

Uh oh!

2uasimojo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

2uasimojo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

2uasimojo commented Jul 29, 2025 •

edited

Loading

2uasimojo commented Jul 29, 2025 •

edited

Loading