Skip to content

Conversation

msu8
Copy link

@msu8 msu8 commented Jul 28, 2025

Adds hibernation functionality for OpenStack clusters.

Hibernation is based on restoring instances from snapshots, freeing up all of the project's resources that were taken by the instances. Replicating true hibernation in OpenStack is difficult to impossible for us - e.g. shelving in OpenStack exists, however it does not release project's resources, as it depends on how OpenStack is configured. This solution is universally compatible.

After a ClusterPool is created and claimed

$ oc patch cd $cd -n $cd_ns --type='merge' -p $'spec:\n powerState: Hibernating'

Hibernation:

  1. Find all cluster instances by infraID prefix
  2. Create snapshots of each instance
  3. Collect and save all instance attributes to a secret (CD ns):
    • Flavor IDs, port IDs, security groups
    • Instance metadata, networking config
    • Snapshot IDs for restoration
  4. Delete the actual instances (this frees up project's resources)

$ oc patch cd $cd -n $cd_ns --type='merge' -p $'spec:\n powerState: Running'

Restoration

  1. Load saved config from secret
  2. Validate project has enough resources
  3. Recreate instances from snapshots
  4. Wait for instances to become ACTIVE
  5. Clean up hibernation secret

This approach still has many quirks, especially the time it takes to save the snapshots and restore the instances, with the instances going through different states - the code is setup to work around those, frequently update and check the actual state of the instances.

Looking forward to feedback and more suggestions.

msu8 added 4 commits July 12, 2025 14:15
- openstack_actuator.go - hibernation based on snapshots/restoring
- openstack/client.go - minimal OpenstackClient to support "hibernation" operations
@openshift-ci openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jul 28, 2025
Copy link
Contributor

openshift-ci bot commented Jul 28, 2025

Hi @msu8. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci openshift-ci bot requested review from 2uasimojo and suhanime July 28, 2025 19:23
Copy link
Contributor

openshift-ci bot commented Jul 28, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: msu8
Once this PR has been reviewed and has the lgtm label, please assign suhanime for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@2uasimojo
Copy link
Member

/ok-to-test

I'll start having a look at this now.

About how long does each side of the operation take?

@openshift-ci openshift-ci bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 29, 2025
@2uasimojo
Copy link
Member

2uasimojo commented Jul 29, 2025

I'd like to have a discussion about the different options available here. Specifically the relative resource savings provided by shelving vs. something like suspend, traded off with the time to perform the hibernate and resume operations.

It has been a minute, so I asked grok to refresh my memory. IIUC shelving:

  • depends on a glance setup (is that pretty much universal with all OpenStack deployments?)
  • frees up all the infra resources (CPU, memory, network, storage) at the expense of big blobs in glance
  • takes a long time because it's pushing/pulling instance disk images across the network

In contrast, suspend:

  • saves RAM state to the compute node for each instance (i.e. a lot less data, and local vs over the network)
  • frees CPU but retains memory and local storage reservations
  • still stops I/O

The architectural raison d'être of hibernation in a cloud scenario is to reduce infra cost while making it faster to resume than to provision. In AWS, the former is ~2/3 savings; and the latter is about an order of magnitude (~4-5m vs 45m). In particular, although it may be a side benefit in some setups, overcommitting is not a primary goal. That is, we're not trying to provide a way to create more clusters than you have resources for, on the theory that you would only ever be running a subset of them. The controls we provide on ClusterPools -- size/runningCount, maxSize and maxConcurrent -- are more designed for quota and throttling than for overcommitting. (It feels like you could plan for saturation based on maxSize - (size - runningCount), but that doesn't account for in-flight provisions that happen when a cluster is claimed from the pool -- those clusters are only hibernated once they have finished provisioning.) And importantly, hibernation in AWS entails stopping instances, but we're not deleting them. We're also not touching other infra pieces, which would still contribute to overall quota usage.

Putting this in context with shelve vs suspend: shelving results in a 100% resource savings (glance aside) at the expense of slow hibernate/resume, with no real guarantee of overcommit capability. Whereas suspend is much quicker (right?) at the cost of effectively enforcing no overcommit.

In conclusion, I feel like suspend aligns more closely with the philosophy for which hibernation was conceived. What are your thoughts? I would like to get @abraverm's input as well.

(BTW, it would have been neat to have this discussion, e.g. via an enhancement proposal, before you sunk all this work into the implementation...)

@2uasimojo
Copy link
Member

2uasimojo commented Jul 29, 2025

Re build errors: it looks like you need to do go mod vendor; make update vendor. (We really should have it so the latter incorporates the former, but it's harder than it sounds to get everything done in the right order :( )

You'll have to explicitly check in the new vendored files from gophercloud.

Copy link
Member

@2uasimojo 2uasimojo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review. I only made it through the client.

I think perhaps we should discuss the design (see above) before spending more time on the code review process.

Also, please assign HIVE-2042 to yourself and reference it in this PR's title so it gets cross-linked.

Comment on lines 362 to 367

```bash
$ oc patch cd $cd -n $cd_ns --type='merge' -p $'spec:\n powerState: Hibernating'
$ oc patch cd $cd -n $cd_ns --type='merge' -p $'spec:\n powerState: Running'
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this chunk isn't needed.

  • Hibernation is used automatically for unclaimed clusters in ClusterPools whenever size > runningCount.
  • Hibernation isn't specific to ClusterPools -- it can be used for standalone ClusterDeployments.
  • These patch commands are already listed above (~L35).

GetFlavorDetails(ctx context.Context, flavorID string) (*FlavorDetails, error)
}

type Network struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we import these typedefs from gophercloud rather than shimming them?

Comment on lines 195 to 210
networks := []servers.Network{
{
UUID: opts.NetworkID,
Port: opts.PortID,
},
}

// Convert custom options to satisfy GopherCloud
createOpts := &servers.CreateOpts{
Name: opts.Name,
ImageRef: opts.ImageRef,
FlavorRef: opts.FlavorRef,
Networks: networks,
SecurityGroups: opts.SecurityGroups,
Metadata: opts.Metadata,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...like here. What's the advantage of having our own ServerCreateOpts that we simply map to servers.CreateOpts in code, vs letting the caller construct the servers.CreateOpts out of effectively the same data?

Comment on lines 164 to 165
server, err := servers.Get(c.computeClient, serverID).Extract()
return server, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, could just be

Suggested change
server, err := servers.Get(c.computeClient, serverID).Extract()
return server, err
return servers.Get(c.computeClient, serverID).Extract()

Comment on lines 222 to 223
image, err := images.Get(c.imageClient, imageID).Extract()
return image, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
image, err := images.Get(c.imageClient, imageID).Extract()
return image, err
return images.Get(c.imageClient, imageID).Extract()

return nil, err
}

portList, err := ports.ExtractPorts(allPages)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another example: why would we need to repackage the resulting []ports.Port into our own []Port rather than just

Suggested change
portList, err := ports.ExtractPorts(allPages)
return ports.ExtractPorts(allPages)

?

}

// Get the security group names for a specific server
func (c *openstackClient) GetServerSecurityGroups(ctx context.Context, serverID string) ([]string, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is truly desirable for this func to return a list of names rather than a list of security groups, suggest naming it

Suggested change
func (c *openstackClient) GetServerSecurityGroups(ctx context.Context, serverID string) ([]string, error) {
func (c *openstackClient) GetServerSecurityGroupNames(ctx context.Context, serverID string) ([]string, error) {

instead.

Comment on lines 436 to 456
if creds.TenantID != "" {
authOpts.TenantID = creds.TenantID
}
if creds.TenantName != "" {
authOpts.TenantName = creds.TenantName
}
if creds.DomainID != "" {
authOpts.DomainID = creds.DomainID
}
if creds.DomainName != "" {
authOpts.DomainName = creds.DomainName
}
if creds.UserDomainID != "" {
authOpts.DomainID = creds.UserDomainID
}
if creds.ProjectDomainID != "" {
authOpts.DomainID = creds.ProjectDomainID
}
if creds.ProjectDomainName != "" {
authOpts.DomainName = creds.ProjectDomainName
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all of these conditionals are redundant. If the creds.* field is the empty string, you're just avoiding setting the corresponding `authOpts.* field to the empty string... which is already its default (nil) value.

...but again, this is only relevant if we continue to see the need for a shim struct rather than just letting the caller use the upstream gophercloud.AuthOptions directly.


// Create a client wrapper object for interacting with OpenStack.
func NewClientFromSecret(secret *corev1.Secret) (Client, error) {
cloudsYaml, ok := secret.Data["clouds.yaml"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we have a const for this.

@abraverm
Copy link
Contributor

I'd like to have a discussion about the different options available here. Specifically the relative resource savings provided by shelving vs. something like suspend, traded off with the time to perform the hibernate and resume operations.

It has been a minute, so I asked grok to refresh my memory. IIUC shelving:

  • depends on a glance setup (is that pretty much universal with all OpenStack deployments?)
  • frees up all the infra resources (CPU, memory, network, storage) at the expense of big blobs in glance
  • takes a long time because it's pushing/pulling instance disk images across the network

In contrast, suspend:

  • saves RAM state to the compute node for each instance (i.e. a lot less data, and local vs over the network)
  • frees CPU but retains memory and local storage reservations
  • still stops I/O

The architectural raison d'être of hibernation in a cloud scenario is to reduce infra cost while making it faster to resume than to provision. In AWS, the former is ~2/3 savings; and the latter is about an order of magnitude (~4-5m vs 45m). In particular, although it may be a side benefit in some setups, overcommitting is not a primary goal. That is, we're not trying to provide a way to create more clusters than you have resources for, on the theory that you would only ever be running a subset of them. The controls we provide on ClusterPools -- size/runningCount, maxSize and maxConcurrent -- are more designed for quota and throttling than for overcommitting. (It feels like you could plan for saturation based on maxSize - (size - runningCount), but that doesn't account for in-flight provisions that happen when a cluster is claimed from the pool -- those clusters are only hibernated once they have finished provisioning.) And importantly, hibernation in AWS entails stopping instances, but we're not deleting them. We're also not touching other infra pieces, which would still contribute to overall quota usage.

Putting this in context with shelve vs suspend: shelving results in a 100% resource savings (glance aside) at the expense of slow hibernate/resume, with no real guarantee of overcommit capability. Whereas suspend is much quicker (right?) at the cost of effectively enforcing no overcommit.

In conclusion, I feel like suspend aligns more closely with the philosophy for which hibernation was conceived. What are your thoughts? I would like to get @abraverm's input as well.

The main purpose of hibernation in on premise clouds is to increase number of clusters, as the resources are limited and disk is cheaper than memory and CPU. That is having more clusters in hibernation, ready to be started and save the time it takes to deploy a new cluster (about ~10 min vs 1 hour). In Openstack each project (tenant) usually has a resources quota. Suspending an instance (VM) doesn't reduce the resource usage. Shelving could reduce the resource uage but it requires additional configuration - Quota usage from placement, which in our case caused problem and we had to revert. This is why we ended up with this design.

(BTW, it would have been neat to have this discussion, e.g. via an enhancement proposal, before you sunk all this work into the implementation...)

Sure thing.

@2uasimojo
Copy link
Member

(about ~10 min vs 1 hour).

If this is a good heuristic for the timing of the shelving/unshelving process, then I'm fine with it. I was just envisioning trying to shove a couple TB of disk images across a network to a storage device and thinking that might take long enough to kill the value vs spinning up a new cluster. Presumably the speed is dependent on having glance connected to particular types of storage -- more direct-attached-y and less network-y? If so, that seems like something worth calling out in the doc section you're adding here.

The main purpose of hibernation in on premise clouds is to increase number of clusters

I'll defer to your experience on this point, but I'll reiterate that hive seems to have been designed (before my time) such that overcommit capability was not a primary goal. Let me elaborate on the example I mentioned above:

(It feels like you could plan for saturation based on maxSize - (size - runningCount), but that doesn't account for in-flight provisions that happen when a cluster is claimed from the pool -- those clusters are only hibernated once they have finished provisioning.)

So simplistically let's say you have 1 CPU per cluster and a quota of 5 CPU. You think you can overcommit with the following settings:

size: 7
runningCount: 2
maxSize: 10
maxConcurrent: 5
  • maxSize - size = 10 - 7 = 3 -- this is the max number of claimed clusters at once.
  • Balanced with runningCount = 2, that's your 5 total running clusters to saturate your CPU quota.
  • size - runningCount = 7 - 2 = 5 -- this is your overcommit, which is okay because they're hibernating and don't affect quota.
  • maxConcurrent = 5 is how you're able to populate the pool initially: provisioning clusters are effectively "running" for quota purposes; you need this throttle so the pool waits until the first cluster finishes and hibernates before starting the sixth provision, etc.

Now consider the steady state of a populated pool with no claimed clusters:
==> 5 hibernating, 2 running.
Three claims come in.
We satisfy two of them with the already-running clusters, and resume one of the hibernating ones for the third claim.
==> 4 hibernating, 3 running+claimed.
Now we try to replenish the pool. We need to add 3 clusters.
==> 4 hibernating, 3 running+claimed, 3 provisioning.
That's 3+3=6 clusters contributing to quota, so one of the new ones will fail to provision.

I'm not saying you shouldn't try to play this game; just that, even when you're pretty sure you've got the math right, there are non-crazy scenarios that will still blow up, and it's because the controller logic wasn't designed with overcommit in mind.

Quota usage from placement

Interesting side note, this text looks eerily familiar. I was a Placement core while I was working in upstream OpenStack, and it's possible I had a hand in authoring this, or at least reviewing it :)

@abraverm
Copy link
Contributor

If this is a good heuristic for the timing of the shelving/unshelving process, then I'm fine with it. I was just
envisioning trying to shove a couple TB of disk images across a network to a storage device and thinking that might take long enough to kill the value vs spinning up a new cluster. Presumably the speed is dependent on having glance connected to particular types of storage -- more direct-attached-y and less network-y? If so, that seems like something worth calling out in the doc section you're adding here.

IIRC the snapshots were actully smaller than expected, that is not 100% copy of a disk.

@msu8, please add info in the doc about the expected hibernation storage usage for normal deployment.

I'm not saying you shouldn't try to play this game; just that, even when you're pretty sure you've got the math right, there are non-crazy scenarios that will still blow up, and it's because the controller logic wasn't designed with overcommit in mind.

We could add some queue for on-prem cloud credentials that would decide if to deploy a clusters or restore from hibernation, but lets call it future work and avoid over-engineering this feature.

@msu8, could you add a warning in the documenation about how Hive would try restore a cluster from hibernation and how it doesn't solves a competing claims and clusterpools?

@msu8
Copy link
Author

msu8 commented Jul 31, 2025

Really appreciate your initial look at this @2uasimojo , and thanks for further clarifying some stuff @abraverm , I'm occupied with other projects right now, so I'll try to address the above issues as soon as possible (early next week).

@msu8
Copy link
Author

msu8 commented Aug 11, 2025

Hi, I tried to address the feedback from the first review and shortened the client considerably, as well as hopefully fixed the vendor issue.

I was stuck a little bit in overengineering the actuator code - I added a two-phase snapshot deletion to free up as much resources as possible. But I’m not sure whether this is the correct way to do this, so I’m looking forward to more feedback.

The issue essentially is that if you create a basic cp (3 masters, 3 workers), the worker snapshots cannot be deleted, only the master instances can. This means that the “old” worker instances will be cleaned up in the next hibernation phase and new snapshots will be created.

However I’m not too confident on the correctness of this solution or whether this is even necessary, as this will leave out the possibility to manually recreate the master instance; if that’s even desirable. At the same time, this minimizes the resource use even more (although the snapshots don’t take away from the project resources - they need to take away from somewhere).

Also, I did not add more to the documents as I feel there’ll be more work and changes on the doc side of things.

@abraverm
Copy link
Contributor

The issue essentially is that if you create a basic cp (3 masters, 3 workers), the worker snapshots cannot be deleted, only the master instances can. This means that the “old” worker instances will be cleaned up in the next hibernation phase and new snapshots will be created.

Having two snapshots for a short time is not ideal but there is nothing you can do about it AFAIK. I think you should document this so the user won't be surprised that during hibernation there is a surge of disk usage on Openstack.

Copy link
Member

@2uasimojo 2uasimojo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial.

coresUsed = int(usage.TotalVCPUsUsage)
ramUsed = int(usage.TotalMemoryMBUsage)
}

// Check available resources
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally don't bother with preflight checks for quota usage. The reason is that many factors can affect quotas asynchronously: other clusters may be provisioning/deprovisioning, hibernating/resuming, etc. -- which may or may not even be happening/visible on this hub. Point being that you're not seriously reducing the probability of hitting a quota failure by doing this check.

I won't veto it if you really want to keep it, but I see it being mostly a waste of code, including the associated maintenance etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msu8 , I recommend to scratch this part considering the impact.

@@ -28,53 +30,24 @@ type Client interface {
GetServer(ctx context.Context, serverID string) (*servers.Server, error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a glance (heh) this interface looks a lot better now, thank you!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Later] I looked a bit deeper, still wanting to cut out some of the intermediary structs, see below.

return err
}

// Create snapshots for each instance
// 3.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There doesn't seem to be a 2. 😇

@@ -199,9 +180,10 @@ func (a *openstackActuator) createSnapshots(openstackClient openstackclient.Clie
for i, server := range servers {
logger.Infof("creating snapshot %d/%d for instance %s", i+1, len(servers), server.Name)

snapshotID, err := openstackClient.CreateServerSnapshot(ctx, server.ID, server.Name)
snapshotName := fmt.Sprintf("%s-hibernation-%s", server.Name, time.Now().UTC().Format("20060102-150405"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running time.Now() inside of the loop means we're likely to end up with different suffixes for snapshots for the different servers for the same hibernation event. Is that the desired outcome? I would have thought it would be better for debuggability if they matched.

Comment on lines 392 to 398
flavorCache := make(map[string]*flavors.Flavor) // Changed type

for _, instance := range instances {
requirements.Instances++

// Get flavor details
var flavor *openstackclient.FlavorDetails
var flavor *flavors.Flavor // Changed type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were these Changed type comments supposed to be permanent?

Comment on lines 1004 to 1008
// Create map of current snapshot IDs for exclusion
currentIDs := make(map[string]bool)
for _, id := range currentSnapshotIDs {
currentIDs[id] = true
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you may want to use sets from apimachinery. Lots of examples in the hive project.


// Filter snapshots
for _, snapshot := range existingSnapshots {
if strings.Contains(snapshot.Name, infraID) && strings.Contains(snapshot.Name, "hibernation") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this matching the names generated here? Is an infraID here the same as a server.Name there?

Perhaps it makes sense to make two (or more) funcs for snapshot names -- one to generate, another to match -- and put them right next to each other to make it easy for future-us to find, read, and correlate them.

ctx := context.Background()

// List all images
existingSnapshots, err := openstackClient.ListImages(ctx, nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like an opportunity to make use of the ListOpts param to make ListImages() return only the relevant images, avoiding the need to filter below. You could implement a custom ListOptsBuilder for that... but it might be cooler if you tagged the snapshots with the infra ID (and/or a static hibernation string) when created; and then you could use the native Tags filter here.

[Later] Ugh, I just found out that you can't tag a snapshot on creation; you have to do it afterward. That's terrible, and pretty much kills the above idea, because we can't count on not getting interrupted between those two operations. We could declare failure if that happens; or we could make the whole thing idempotent; but then we still have to have logic to look up the images based on name so we can delete or tag orphans.

[Later later] GDI they really don't want to make it easy to prefilter these results, do they? I'm just worried about scaling here. An Image is a biggish struct, and there could potentially be quite a lot of them in a given cloud at a given time, and using no ListOpts and then AllPages() is guaranteed to give you all of them, which could be a nontrivial amount of memory.

I'm out of time for today, but lmk if you have any ideas for how to make this more efficient.

Copy link
Member

@2uasimojo 2uasimojo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial.

(Sorry, forgot to post this last night; and I've got to pause reviewing this for other priorities today.)

@@ -28,53 +30,24 @@ type Client interface {
GetServer(ctx context.Context, serverID string) (*servers.Server, error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Later] I looked a bit deeper, still wanting to cut out some of the intermediary structs, see below.

Comment on lines 59 to 102
type Credentials struct {
AuthURL string `json:"auth_url"`
Username string `json:"username,omitempty"`
Password string `json:"password,omitempty"`
UserID string `json:"user_id,omitempty"`
ProjectID string `json:"project_id,omitempty"`
ProjectName string `json:"project_name,omitempty"`
UserDomainName string `json:"user_domain_name,omitempty"`
UserDomainID string `json:"user_domain_id,omitempty"`
ProjectDomainName string `json:"project_domain_name,omitempty"`
ProjectDomainID string `json:"project_domain_id,omitempty"`
RegionName string `json:"region_name,omitempty"`
Interface string `json:"interface,omitempty"`
IdentityAPIVersion string `json:"identity_api_version,omitempty"`

TenantID string `json:"tenant_id,omitempty"`
TenantName string `json:"tenant_name,omitempty"`
DomainID string `json:"domain_id,omitempty"`
DomainName string `json:"domain_name,omitempty"`
Region string `json:"region,omitempty"`
}

type CloudsYAML struct {
Clouds map[string]CloudConfig `yaml:"clouds"`
}

type CloudConfig struct {
Auth CloudAuth `yaml:"auth"`
Region string `yaml:"region_name"`
Interface string `yaml:"interface"`
Version string `yaml:"identity_api_version"`
}

type CloudAuth struct {
AuthURL string `yaml:"auth_url"`
Username string `yaml:"username"`
Password string `yaml:"password"`
ProjectID string `yaml:"project_id"`
ProjectName string `yaml:"project_name"`
UserDomainName string `yaml:"user_domain_name"`
ProjectDomainName string `yaml:"project_domain_name"`
UserDomainID string `yaml:"user_domain_id"`
ProjectDomainID string `yaml:"project_domain_id"`
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we can't get all of this from upstream?

networkClient *gophercloud.ServiceClient
credentials *Credentials
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good practice to include a type validation statement like

var _ Client = &openstackClient{}

This will go red if something changes such that openstackClient doesn't satisfy the Client interface.

func (a *openstackActuator) findInstancesByPrefix(openstackClient openstackclient.Client, prefix string) ([]ServerInfo, error) {
ctx := context.Background()

servers, err := openstackClient.ListServers(ctx, nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so here I think we probably can make this more efficient by taking advantage of opts. Worth checking with the installer team, but usually they tag cloud resources associated with an OpenShift cluster (to make it easier to find them when deprovisioning). For example, AWS resources for an OpenShift cluster are all tagged with kubernetes.io/cluster/$infraID=owned.

And, nit, consider spelling it findInstancesByInfraID(..., infraID string)

Comment on lines 668 to 672
// basic server information
type ServerInfo struct {
ID string
Name string
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I get why you did this -- Server itself is kind of big to be lugging around -- but since it seems unavoidable that we're going to be slurping the whole struct across the wire anyway, the only difference it makes is when it gets garbage collected. The duration of one reconcile loop should be short enough that we're not gaining anything by releasing it "early". As long as you're passing around *Server, you'll avoid making copies, and then IMO it makes sense to get rid of this type.

...unless I missed somewhere we're saving this struct to persistent storage?

// Build configuration for each instance
var instanceConfigs []OpenStackInstanceConfig
for i, serverInfo := range matchingServers {
serverDetails, err := openstackClient.GetServer(ctx, serverInfo.ID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pursuant to comment, if we were passing around *Server instead of our reduced ServerInfo, this call would be unnecessary.

}

// Get security groups
secGroups, err := openstackClient.GetServerSecurityGroupNames(ctx, serverInfo.ID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these security groups persist when the instances are deleted? Can they still be looked up by server ID at that point? My intent: Either we need to recreate the SGs as part of the resume process; or as noted above we would be able to do this lookup on the resume side instead of here.

type openstackActuator struct {
openstackClientFn func(*hivev1.ClusterDeployment, client.Client, log.FieldLogger) (openstackclient.Client, error)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider a type assertion

var _ HibernationActuator = &openstackActuator{}


// openstackActuator implements HibernationActuator for OpenStack
type openstackActuator struct {
openstackClientFn func(*hivev1.ClusterDeployment, client.Client, log.FieldLogger) (openstackclient.Client, error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an idea: rather than passing the client around as a param to every helper func, we could cache it in the actuator struct. It would go something like:

diff --git a/pkg/controller/hibernation/openstack_actuator.go b/pkg/controller/hibernation/openstack_actuator.go
index 50fbeeb7d..623d6efba 100644
--- a/pkg/controller/hibernation/openstack_actuator.go
+++ b/pkg/controller/hibernation/openstack_actuator.go
@@ -28,6 +28,16 @@ func init() {
 // openstackActuator implements HibernationActuator for OpenStack
 type openstackActuator struct {
        openstackClientFn func(*hivev1.ClusterDeployment, client.Client, log.FieldLogger) (openstackclient.Client, error)
+       openstackClient      openstackclient.Client
+}
+
+func (a *openstackActuator) cacheClient(cd *hivev1.ClusterDeployment, c client.Client, logger log.FieldLogger) error {
+       if a.openstackClient != nil {
+               return nil
+       }
+       var err error
+       a.openstackClient, err = a.openstackClientFn(cd, c, logger)
+       return err
 }
 
 // Create API client
@@ -63,7 +73,7 @@ func (a *openstackActuator) StopMachines(cd *hivev1.ClusterDeployment, hiveClien
        logger = logger.WithField("cloud", "openstack")
        logger.Info("stopping machines and creating snapshots")
 
-       openstackClient, err := a.openstackClientFn(cd, hiveClient, logger)
+       err := a.cacheClient(cd, hiveClient, logger)
        if err != nil {
                return fmt.Errorf("failed to create OpenStack client: %v", err)
        }
@@ -71,7 +81,7 @@ func (a *openstackActuator) StopMachines(cd *hivev1.ClusterDeployment, hiveClien
        infraID := cd.Spec.ClusterMetadata.InfraID
 
        // 1.
-       matchingServers, err := a.findInstancesByPrefix(openstackClient, infraID)
+       matchingServers, err := a.findInstancesByPrefix(infraID)
        if err != nil {
                return fmt.Errorf("error finding instances: %v", err)
        }
@@ -684,10 +694,10 @@ type OpenStackInstanceConfig struct {
 }
 
 // Return servers that match the infraID prefix
-func (a *openstackActuator) findInstancesByPrefix(openstackClient openstackclient.Client, prefix string) ([]ServerInfo, error) {
+func (a *openstackActuator) findInstancesByPrefix(prefix string) ([]ServerInfo, error) {
        ctx := context.Background()
 
-       servers, err := openstackClient.ListServers(ctx, nil)
+       servers, err := a.openstackClient.ListServers(ctx, nil)
        if err != nil {
                return nil, fmt.Errorf("error listing servers: %v", err)
        }

...and similar changes in the other helper funcs.

The tradeoff is tighter coupling: the helpers rely on that cache func having been called, or they'll panic. I'm not sure if it's worth the characters saved by eliminating that extra param 🤷

(I would have liked for all the helpers to call something more like a.getClient().DoWhatever(...), and have getClient() do the nil check and cache the client if necessary... but then there's no good way to handle the error from the client creation :( )


logger.Infof("found %d instances to hibernate", len(matchingServers))

if err := a.validateInstanceStates(openstackClient, matchingServers, logger); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again here, I'm not sure it helps us to prevalidate. Since things can happen asynchronously, we would still need to have an error path for when we try to do the real thing and it fails because an instance is "invalid".

Also for idempotence, if we are interrupted partway through, is it possible to hit the next iteration and encounter one or more instances legitimately deleting, because we already snapshotted them last time? [Later] Hm, not as the algorithm is currently written... but I don't think that flow is idempotent...

This is sending me down a mental rabbit hole. Is it possible to somehow "freeze" or "lock" an instance so that it can't be touched by another process while we're working? If so, how would we be able to touch it in the aforementioned second-run-through scenario?

This question is actually relevant in terms of losing work as well. Scenario: we create a snapshot while some process is running on the instance. The process continues after our snapshot finishes, but before we delete the instance. When we restore, we'll go back to that old state, having lost whatever progress was made in the interim. Depending on the workload, that may or may not be recoverable.

Can we Pause() the instances as the first step?

…shot naming, pausing

other fixes to logging, structure, comments
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 12, 2025
@openshift-merge-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@msu8
Copy link
Author

msu8 commented Sep 12, 2025

Hi, thanks a lot for the amazing feedback and sorry that the following commit took so much tim - some things were a bit difficult for me and I unfortunately had to focus on other work.

I think I addressed all of the feedback except for a couple off points that I think aren't as necessary or would add friction/slowdown - I'll as additional questions monday and will elaborate further. Sort of time pressed.

  • Removed quota validation
  • Fixed timestamp consistency
  • Streamlined snapshot fetching with using already known names
  • Removed redundant structs
  • Add client creation from upstream clientconfig
  • Added instance tagging
    -- {
    "name": "rd-ssegal-01-xczfk-worker-0-q4d76",
    "flavor": "66502c57-663a-4e51-82f0-bdbd9522ace0",
    "portID": "34f4f1f2-37b6-47a1-8558-98239b2e535b",
    "snapshotID": "04a3aaaf-8338-4b30-96de-2e1775a97c7f",
    "snapshotName": "rd-ssegal-01-xczfk-worker-0-q4d76-hibernation-20250912-155926",
    "securityGroups": [
    "rd-ssegal-01-xczfk-worker"
    ],
    "clusterID": "rd-ssegal-01-xczfk",
    "networkID": "f544f0d0-5051-4fd8-acdc-9edcad84f916",
    "openshiftClusterID": "rd-ssegal-01-xczfk",
    "metadata": {
    "Name": "rd-ssegal-01-xczfk-worker",
    "openshiftClusterID": "rd-ssegal-01-xczfk"
    },
    "tags": [ <------
    "cluster-api-provider-openstack",
    "openshift-machine-api-rd-ssegal-01-xczfk",
    "openshiftClusterID=rd-ssegal-01-xczfk"
    ]
    },
  • Added pausing/unpausing -> specificially requires a microversion to be specified otherwise it doesn't work
  • Other smaller changes like renaming, structure, comments, logging
  • Updated docs

Copy link
Contributor

openshift-ci bot commented Sep 12, 2025

@msu8: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-openstack 541b6dd link true /test e2e-openstack
ci/prow/images 541b6dd link true /test images
ci/prow/periodic-images 541b6dd link true /test periodic-images
ci/prow/coverage 541b6dd link true /test coverage
ci/prow/e2e-pool 541b6dd link true /test e2e-pool
ci/prow/verify 541b6dd link true /test verify
ci/prow/security 541b6dd link true /test security
ci/prow/unit 541b6dd link true /test unit
ci/prow/e2e 541b6dd link true /test e2e

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants