-
Notifications
You must be signed in to change notification settings - Fork 255
OpenStack Hibernation #2722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
OpenStack Hibernation #2722
Conversation
- openstack_actuator.go - hibernation based on snapshots/restoring - openstack/client.go - minimal OpenstackClient to support "hibernation" operations
Hi @msu8. Thanks for your PR. I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: msu8 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/ok-to-test I'll start having a look at this now. About how long does each side of the operation take? |
I'd like to have a discussion about the different options available here. Specifically the relative resource savings provided by shelving vs. something like suspend, traded off with the time to perform the hibernate and resume operations. It has been a minute, so I asked grok to refresh my memory. IIUC shelving:
In contrast, suspend:
The architectural raison d'être of hibernation in a cloud scenario is to reduce infra cost while making it faster to resume than to provision. In AWS, the former is ~2/3 savings; and the latter is about an order of magnitude (~4-5m vs 45m). In particular, although it may be a side benefit in some setups, overcommitting is not a primary goal. That is, we're not trying to provide a way to create more clusters than you have resources for, on the theory that you would only ever be running a subset of them. The controls we provide on ClusterPools -- Putting this in context with shelve vs suspend: shelving results in a 100% resource savings (glance aside) at the expense of slow hibernate/resume, with no real guarantee of overcommit capability. Whereas suspend is much quicker (right?) at the cost of effectively enforcing no overcommit. In conclusion, I feel like suspend aligns more closely with the philosophy for which hibernation was conceived. What are your thoughts? I would like to get @abraverm's input as well. (BTW, it would have been neat to have this discussion, e.g. via an enhancement proposal, before you sunk all this work into the implementation...) |
Re build errors: it looks like you need to do You'll have to explicitly check in the new vendored files from gophercloud. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review. I only made it through the client.
I think perhaps we should discuss the design (see above) before spending more time on the code review process.
Also, please assign HIVE-2042 to yourself and reference it in this PR's title so it gets cross-linked.
docs/hibernating-clusters.md
Outdated
|
||
```bash | ||
$ oc patch cd $cd -n $cd_ns --type='merge' -p $'spec:\n powerState: Hibernating' | ||
$ oc patch cd $cd -n $cd_ns --type='merge' -p $'spec:\n powerState: Running' | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this chunk isn't needed.
- Hibernation is used automatically for unclaimed clusters in ClusterPools whenever
size > runningCount
. - Hibernation isn't specific to ClusterPools -- it can be used for standalone ClusterDeployments.
- These patch commands are already listed above (~L35).
pkg/openstackclient/client.go
Outdated
GetFlavorDetails(ctx context.Context, flavorID string) (*FlavorDetails, error) | ||
} | ||
|
||
type Network struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we import these typedefs from gophercloud rather than shimming them?
pkg/openstackclient/client.go
Outdated
networks := []servers.Network{ | ||
{ | ||
UUID: opts.NetworkID, | ||
Port: opts.PortID, | ||
}, | ||
} | ||
|
||
// Convert custom options to satisfy GopherCloud | ||
createOpts := &servers.CreateOpts{ | ||
Name: opts.Name, | ||
ImageRef: opts.ImageRef, | ||
FlavorRef: opts.FlavorRef, | ||
Networks: networks, | ||
SecurityGroups: opts.SecurityGroups, | ||
Metadata: opts.Metadata, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...like here. What's the advantage of having our own ServerCreateOpts
that we simply map to servers.CreateOpts
in code, vs letting the caller construct the servers.CreateOpts
out of effectively the same data?
pkg/openstackclient/client.go
Outdated
server, err := servers.Get(c.computeClient, serverID).Extract() | ||
return server, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, could just be
server, err := servers.Get(c.computeClient, serverID).Extract() | |
return server, err | |
return servers.Get(c.computeClient, serverID).Extract() |
pkg/openstackclient/client.go
Outdated
image, err := images.Get(c.imageClient, imageID).Extract() | ||
return image, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
image, err := images.Get(c.imageClient, imageID).Extract() | |
return image, err | |
return images.Get(c.imageClient, imageID).Extract() |
pkg/openstackclient/client.go
Outdated
return nil, err | ||
} | ||
|
||
portList, err := ports.ExtractPorts(allPages) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another example: why would we need to repackage the resulting []ports.Port
into our own []Port
rather than just
portList, err := ports.ExtractPorts(allPages) | |
return ports.ExtractPorts(allPages) |
?
pkg/openstackclient/client.go
Outdated
} | ||
|
||
// Get the security group names for a specific server | ||
func (c *openstackClient) GetServerSecurityGroups(ctx context.Context, serverID string) ([]string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is truly desirable for this func to return a list of names rather than a list of security groups, suggest naming it
func (c *openstackClient) GetServerSecurityGroups(ctx context.Context, serverID string) ([]string, error) { | |
func (c *openstackClient) GetServerSecurityGroupNames(ctx context.Context, serverID string) ([]string, error) { |
instead.
pkg/openstackclient/client.go
Outdated
if creds.TenantID != "" { | ||
authOpts.TenantID = creds.TenantID | ||
} | ||
if creds.TenantName != "" { | ||
authOpts.TenantName = creds.TenantName | ||
} | ||
if creds.DomainID != "" { | ||
authOpts.DomainID = creds.DomainID | ||
} | ||
if creds.DomainName != "" { | ||
authOpts.DomainName = creds.DomainName | ||
} | ||
if creds.UserDomainID != "" { | ||
authOpts.DomainID = creds.UserDomainID | ||
} | ||
if creds.ProjectDomainID != "" { | ||
authOpts.DomainID = creds.ProjectDomainID | ||
} | ||
if creds.ProjectDomainName != "" { | ||
authOpts.DomainName = creds.ProjectDomainName | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all of these conditionals are redundant. If the creds.*
field is the empty string, you're just avoiding setting the corresponding `authOpts.* field to the empty string... which is already its default (nil) value.
...but again, this is only relevant if we continue to see the need for a shim struct rather than just letting the caller use the upstream gophercloud.AuthOptions
directly.
pkg/openstackclient/client.go
Outdated
|
||
// Create a client wrapper object for interacting with OpenStack. | ||
func NewClientFromSecret(secret *corev1.Secret) (Client, error) { | ||
cloudsYaml, ok := secret.Data["clouds.yaml"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we have a const for this.
The main purpose of hibernation in on premise clouds is to increase number of clusters, as the resources are limited and disk is cheaper than memory and CPU. That is having more clusters in hibernation, ready to be started and save the time it takes to deploy a new cluster (about ~10 min vs 1 hour). In Openstack each project (tenant) usually has a resources quota. Suspending an instance (VM) doesn't reduce the resource usage. Shelving could reduce the resource uage but it requires additional configuration - Quota usage from placement, which in our case caused problem and we had to revert. This is why we ended up with this design.
Sure thing. |
If this is a good heuristic for the timing of the shelving/unshelving process, then I'm fine with it. I was just envisioning trying to shove a couple TB of disk images across a network to a storage device and thinking that might take long enough to kill the value vs spinning up a new cluster. Presumably the speed is dependent on having glance connected to particular types of storage -- more direct-attached-y and less network-y? If so, that seems like something worth calling out in the doc section you're adding here.
I'll defer to your experience on this point, but I'll reiterate that hive seems to have been designed (before my time) such that overcommit capability was not a primary goal. Let me elaborate on the example I mentioned above:
So simplistically let's say you have 1 CPU per cluster and a quota of 5 CPU. You think you can overcommit with the following settings: size: 7
runningCount: 2
maxSize: 10
maxConcurrent: 5
Now consider the steady state of a populated pool with no claimed clusters: I'm not saying you shouldn't try to play this game; just that, even when you're pretty sure you've got the math right, there are non-crazy scenarios that will still blow up, and it's because the controller logic wasn't designed with overcommit in mind. Interesting side note, this text looks eerily familiar. I was a Placement core while I was working in upstream OpenStack, and it's possible I had a hand in authoring this, or at least reviewing it :) |
IIRC the snapshots were actully smaller than expected, that is not 100% copy of a disk. @msu8, please add info in the doc about the expected hibernation storage usage for normal deployment.
We could add some queue for on-prem cloud credentials that would decide if to deploy a clusters or restore from hibernation, but lets call it future work and avoid over-engineering this feature. @msu8, could you add a warning in the documenation about how Hive would try restore a cluster from hibernation and how it doesn't solves a competing claims and clusterpools? |
Really appreciate your initial look at this @2uasimojo , and thanks for further clarifying some stuff @abraverm , I'm occupied with other projects right now, so I'll try to address the above issues as soon as possible (early next week). |
- add new two-phase snapshot cleanup - add vendor
…nstack_hibernation
Hi, I tried to address the feedback from the first review and shortened the client considerably, as well as hopefully fixed the vendor issue. I was stuck a little bit in overengineering the actuator code - I added a two-phase snapshot deletion to free up as much resources as possible. But I’m not sure whether this is the correct way to do this, so I’m looking forward to more feedback. The issue essentially is that if you create a basic cp (3 masters, 3 workers), the worker snapshots cannot be deleted, only the master instances can. This means that the “old” worker instances will be cleaned up in the next hibernation phase and new snapshots will be created. However I’m not too confident on the correctness of this solution or whether this is even necessary, as this will leave out the possibility to manually recreate the master instance; if that’s even desirable. At the same time, this minimizes the resource use even more (although the snapshots don’t take away from the project resources - they need to take away from somewhere). Also, I did not add more to the documents as I feel there’ll be more work and changes on the doc side of things. |
Having two snapshots for a short time is not ideal but there is nothing you can do about it AFAIK. I think you should document this so the user won't be surprised that during hibernation there is a surge of disk usage on Openstack. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial.
coresUsed = int(usage.TotalVCPUsUsage) | ||
ramUsed = int(usage.TotalMemoryMBUsage) | ||
} | ||
|
||
// Check available resources |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We generally don't bother with preflight checks for quota usage. The reason is that many factors can affect quotas asynchronously: other clusters may be provisioning/deprovisioning, hibernating/resuming, etc. -- which may or may not even be happening/visible on this hub. Point being that you're not seriously reducing the probability of hitting a quota failure by doing this check.
I won't veto it if you really want to keep it, but I see it being mostly a waste of code, including the associated maintenance etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@msu8 , I recommend to scratch this part considering the impact.
@@ -28,53 +30,24 @@ type Client interface { | |||
GetServer(ctx context.Context, serverID string) (*servers.Server, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a glance (heh) this interface looks a lot better now, thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Later] I looked a bit deeper, still wanting to cut out some of the intermediary structs, see below.
return err | ||
} | ||
|
||
// Create snapshots for each instance | ||
// 3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There doesn't seem to be a 2.
😇
@@ -199,9 +180,10 @@ func (a *openstackActuator) createSnapshots(openstackClient openstackclient.Clie | |||
for i, server := range servers { | |||
logger.Infof("creating snapshot %d/%d for instance %s", i+1, len(servers), server.Name) | |||
|
|||
snapshotID, err := openstackClient.CreateServerSnapshot(ctx, server.ID, server.Name) | |||
snapshotName := fmt.Sprintf("%s-hibernation-%s", server.Name, time.Now().UTC().Format("20060102-150405")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running time.Now()
inside of the loop means we're likely to end up with different suffixes for snapshots for the different servers for the same hibernation event. Is that the desired outcome? I would have thought it would be better for debuggability if they matched.
flavorCache := make(map[string]*flavors.Flavor) // Changed type | ||
|
||
for _, instance := range instances { | ||
requirements.Instances++ | ||
|
||
// Get flavor details | ||
var flavor *openstackclient.FlavorDetails | ||
var flavor *flavors.Flavor // Changed type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were these Changed type
comments supposed to be permanent?
// Create map of current snapshot IDs for exclusion | ||
currentIDs := make(map[string]bool) | ||
for _, id := range currentSnapshotIDs { | ||
currentIDs[id] = true | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you may want to use sets from apimachinery. Lots of examples in the hive project.
|
||
// Filter snapshots | ||
for _, snapshot := range existingSnapshots { | ||
if strings.Contains(snapshot.Name, infraID) && strings.Contains(snapshot.Name, "hibernation") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this matching the names generated here? Is an infraID
here the same as a server.Name
there?
Perhaps it makes sense to make two (or more) funcs for snapshot names -- one to generate, another to match -- and put them right next to each other to make it easy for future-us to find, read, and correlate them.
ctx := context.Background() | ||
|
||
// List all images | ||
existingSnapshots, err := openstackClient.ListImages(ctx, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like an opportunity to make use of the ListOpts
param to make ListImages()
return only the relevant images, avoiding the need to filter below. You could implement a custom ListOptsBuilder
for that... but it might be cooler if you tagged the snapshots with the infra ID (and/or a static hibernation
string) when created; and then you could use the native Tags filter here.
[Later] Ugh, I just found out that you can't tag a snapshot on creation; you have to do it afterward. That's terrible, and pretty much kills the above idea, because we can't count on not getting interrupted between those two operations. We could declare failure if that happens; or we could make the whole thing idempotent; but then we still have to have logic to look up the images based on name so we can delete or tag orphans.
[Later later] GDI they really don't want to make it easy to prefilter these results, do they? I'm just worried about scaling here. An Image
is a biggish struct, and there could potentially be quite a lot of them in a given cloud at a given time, and using no ListOpts
and then AllPages()
is guaranteed to give you all of them, which could be a nontrivial amount of memory.
I'm out of time for today, but lmk if you have any ideas for how to make this more efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial.
(Sorry, forgot to post this last night; and I've got to pause reviewing this for other priorities today.)
@@ -28,53 +30,24 @@ type Client interface { | |||
GetServer(ctx context.Context, serverID string) (*servers.Server, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Later] I looked a bit deeper, still wanting to cut out some of the intermediary structs, see below.
pkg/openstackclient/client.go
Outdated
type Credentials struct { | ||
AuthURL string `json:"auth_url"` | ||
Username string `json:"username,omitempty"` | ||
Password string `json:"password,omitempty"` | ||
UserID string `json:"user_id,omitempty"` | ||
ProjectID string `json:"project_id,omitempty"` | ||
ProjectName string `json:"project_name,omitempty"` | ||
UserDomainName string `json:"user_domain_name,omitempty"` | ||
UserDomainID string `json:"user_domain_id,omitempty"` | ||
ProjectDomainName string `json:"project_domain_name,omitempty"` | ||
ProjectDomainID string `json:"project_domain_id,omitempty"` | ||
RegionName string `json:"region_name,omitempty"` | ||
Interface string `json:"interface,omitempty"` | ||
IdentityAPIVersion string `json:"identity_api_version,omitempty"` | ||
|
||
TenantID string `json:"tenant_id,omitempty"` | ||
TenantName string `json:"tenant_name,omitempty"` | ||
DomainID string `json:"domain_id,omitempty"` | ||
DomainName string `json:"domain_name,omitempty"` | ||
Region string `json:"region,omitempty"` | ||
} | ||
|
||
type CloudsYAML struct { | ||
Clouds map[string]CloudConfig `yaml:"clouds"` | ||
} | ||
|
||
type CloudConfig struct { | ||
Auth CloudAuth `yaml:"auth"` | ||
Region string `yaml:"region_name"` | ||
Interface string `yaml:"interface"` | ||
Version string `yaml:"identity_api_version"` | ||
} | ||
|
||
type CloudAuth struct { | ||
AuthURL string `yaml:"auth_url"` | ||
Username string `yaml:"username"` | ||
Password string `yaml:"password"` | ||
ProjectID string `yaml:"project_id"` | ||
ProjectName string `yaml:"project_name"` | ||
UserDomainName string `yaml:"user_domain_name"` | ||
ProjectDomainName string `yaml:"project_domain_name"` | ||
UserDomainID string `yaml:"user_domain_id"` | ||
ProjectDomainID string `yaml:"project_domain_id"` | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we can't get all of this from upstream?
networkClient *gophercloud.ServiceClient | ||
credentials *Credentials | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good practice to include a type validation statement like
var _ Client = &openstackClient{}
This will go red if something changes such that openstackClient
doesn't satisfy the Client
interface.
func (a *openstackActuator) findInstancesByPrefix(openstackClient openstackclient.Client, prefix string) ([]ServerInfo, error) { | ||
ctx := context.Background() | ||
|
||
servers, err := openstackClient.ListServers(ctx, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so here I think we probably can make this more efficient by taking advantage of opts
. Worth checking with the installer team, but usually they tag cloud resources associated with an OpenShift cluster (to make it easier to find them when deprovisioning). For example, AWS resources for an OpenShift cluster are all tagged with kubernetes.io/cluster/$infraID=owned
.
And, nit, consider spelling it findInstancesByInfraID(..., infraID string)
// basic server information | ||
type ServerInfo struct { | ||
ID string | ||
Name string | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I get why you did this -- Server
itself is kind of big to be lugging around -- but since it seems unavoidable that we're going to be slurping the whole struct across the wire anyway, the only difference it makes is when it gets garbage collected. The duration of one reconcile loop should be short enough that we're not gaining anything by releasing it "early". As long as you're passing around *Server
, you'll avoid making copies, and then IMO it makes sense to get rid of this type.
...unless I missed somewhere we're saving this struct to persistent storage?
// Build configuration for each instance | ||
var instanceConfigs []OpenStackInstanceConfig | ||
for i, serverInfo := range matchingServers { | ||
serverDetails, err := openstackClient.GetServer(ctx, serverInfo.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pursuant to comment, if we were passing around *Server
instead of our reduced ServerInfo
, this call would be unnecessary.
} | ||
|
||
// Get security groups | ||
secGroups, err := openstackClient.GetServerSecurityGroupNames(ctx, serverInfo.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these security groups persist when the instances are deleted? Can they still be looked up by server ID at that point? My intent: Either we need to recreate the SGs as part of the resume process; or as noted above we would be able to do this lookup on the resume side instead of here.
type openstackActuator struct { | ||
openstackClientFn func(*hivev1.ClusterDeployment, client.Client, log.FieldLogger) (openstackclient.Client, error) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider a type assertion
var _ HibernationActuator = &openstackActuator{}
|
||
// openstackActuator implements HibernationActuator for OpenStack | ||
type openstackActuator struct { | ||
openstackClientFn func(*hivev1.ClusterDeployment, client.Client, log.FieldLogger) (openstackclient.Client, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an idea: rather than passing the client around as a param to every helper func, we could cache it in the actuator struct. It would go something like:
diff --git a/pkg/controller/hibernation/openstack_actuator.go b/pkg/controller/hibernation/openstack_actuator.go
index 50fbeeb7d..623d6efba 100644
--- a/pkg/controller/hibernation/openstack_actuator.go
+++ b/pkg/controller/hibernation/openstack_actuator.go
@@ -28,6 +28,16 @@ func init() {
// openstackActuator implements HibernationActuator for OpenStack
type openstackActuator struct {
openstackClientFn func(*hivev1.ClusterDeployment, client.Client, log.FieldLogger) (openstackclient.Client, error)
+ openstackClient openstackclient.Client
+}
+
+func (a *openstackActuator) cacheClient(cd *hivev1.ClusterDeployment, c client.Client, logger log.FieldLogger) error {
+ if a.openstackClient != nil {
+ return nil
+ }
+ var err error
+ a.openstackClient, err = a.openstackClientFn(cd, c, logger)
+ return err
}
// Create API client
@@ -63,7 +73,7 @@ func (a *openstackActuator) StopMachines(cd *hivev1.ClusterDeployment, hiveClien
logger = logger.WithField("cloud", "openstack")
logger.Info("stopping machines and creating snapshots")
- openstackClient, err := a.openstackClientFn(cd, hiveClient, logger)
+ err := a.cacheClient(cd, hiveClient, logger)
if err != nil {
return fmt.Errorf("failed to create OpenStack client: %v", err)
}
@@ -71,7 +81,7 @@ func (a *openstackActuator) StopMachines(cd *hivev1.ClusterDeployment, hiveClien
infraID := cd.Spec.ClusterMetadata.InfraID
// 1.
- matchingServers, err := a.findInstancesByPrefix(openstackClient, infraID)
+ matchingServers, err := a.findInstancesByPrefix(infraID)
if err != nil {
return fmt.Errorf("error finding instances: %v", err)
}
@@ -684,10 +694,10 @@ type OpenStackInstanceConfig struct {
}
// Return servers that match the infraID prefix
-func (a *openstackActuator) findInstancesByPrefix(openstackClient openstackclient.Client, prefix string) ([]ServerInfo, error) {
+func (a *openstackActuator) findInstancesByPrefix(prefix string) ([]ServerInfo, error) {
ctx := context.Background()
- servers, err := openstackClient.ListServers(ctx, nil)
+ servers, err := a.openstackClient.ListServers(ctx, nil)
if err != nil {
return nil, fmt.Errorf("error listing servers: %v", err)
}
...and similar changes in the other helper funcs.
The tradeoff is tighter coupling: the helpers rely on that cache func having been called, or they'll panic. I'm not sure if it's worth the characters saved by eliminating that extra param 🤷
(I would have liked for all the helpers to call something more like a.getClient().DoWhatever(...)
, and have getClient()
do the nil check and cache the client if necessary... but then there's no good way to handle the error from the client creation :( )
|
||
logger.Infof("found %d instances to hibernate", len(matchingServers)) | ||
|
||
if err := a.validateInstanceStates(openstackClient, matchingServers, logger); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again here, I'm not sure it helps us to prevalidate. Since things can happen asynchronously, we would still need to have an error path for when we try to do the real thing and it fails because an instance is "invalid".
Also for idempotence, if we are interrupted partway through, is it possible to hit the next iteration and encounter one or more instances legitimately deleting, because we already snapshotted them last time? [Later] Hm, not as the algorithm is currently written... but I don't think that flow is idempotent...
This is sending me down a mental rabbit hole. Is it possible to somehow "freeze" or "lock" an instance so that it can't be touched by another process while we're working? If so, how would we be able to touch it in the aforementioned second-run-through scenario?
This question is actually relevant in terms of losing work as well. Scenario: we create a snapshot while some process is running on the instance. The process continues after our snapshot finishes, but before we delete the instance. When we restore, we'll go back to that old state, having lost whatever progress was made in the interim. Depending on the workload, that may or may not be recoverable.
Can we Pause() the instances as the first step?
…shot naming, pausing other fixes to logging, structure, comments
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Hi, thanks a lot for the amazing feedback and sorry that the following commit took so much tim - some things were a bit difficult for me and I unfortunately had to focus on other work. I think I addressed all of the feedback except for a couple off points that I think aren't as necessary or would add friction/slowdown - I'll as additional questions monday and will elaborate further. Sort of time pressed.
|
@msu8: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Adds hibernation functionality for OpenStack clusters.
Hibernation is based on restoring instances from snapshots, freeing up all of the project's resources that were taken by the instances. Replicating true hibernation in OpenStack is difficult to impossible for us - e.g. shelving in OpenStack exists, however it does not release project's resources, as it depends on how OpenStack is configured. This solution is universally compatible.
After a ClusterPool is created and claimed
$ oc patch cd $cd -n $cd_ns --type='merge' -p $'spec:\n powerState: Hibernating'
Hibernation:
$ oc patch cd $cd -n $cd_ns --type='merge' -p $'spec:\n powerState: Running'
Restoration
This approach still has many quirks, especially the time it takes to save the snapshots and restore the instances, with the instances going through different states - the code is setup to work around those, frequently update and check the actual state of the instances.
Looking forward to feedback and more suggestions.