Skip to content
This repository has been archived by the owner on Dec 11, 2023. It is now read-only.

delete guest cluster nodes from Kubernetes API #484

Merged
merged 31 commits into from
Jun 8, 2018
Merged

delete guest cluster nodes from Kubernetes API #484

merged 31 commits into from
Jun 8, 2018

Conversation

xh3b4sd
Copy link
Contributor

@xh3b4sd xh3b4sd commented May 30, 2018

@xh3b4sd xh3b4sd self-assigned this May 30, 2018
@xh3b4sd xh3b4sd deployed to gorgoth June 7, 2018 14:01 Active
@xh3b4sd xh3b4sd deployed to gorgoth June 7, 2018 14:44 Active
@xh3b4sd xh3b4sd deployed to gorgoth June 7, 2018 15:15 Active
@xh3b4sd xh3b4sd deployed to gorgoth June 7, 2018 15:39 Active
@xh3b4sd xh3b4sd deployed to gorgoth June 7, 2018 16:03 Active
@xh3b4sd xh3b4sd deployed to gorgoth June 7, 2018 16:21 Active
@xh3b4sd xh3b4sd deployed to gorgoth June 7, 2018 16:48 Active
@xh3b4sd xh3b4sd deployed to gorgoth June 7, 2018 17:11 Active
Copy link
Contributor Author

@xh3b4sd xh3b4sd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This got successfully tested on gorgoth.

@@ -93,6 +103,10 @@
name = "k8s.io/client-go"
version = "kubernetes-1.9.3"

[[constraint]]
branch = "release-1.9"
name = "k8s.io/kubernetes"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to pin this. There was no other way I found to get the stupid deps right. This maybe improves again with the move to 1.10.4.

K8sClient kubernetes.Interface
K8sExtClient apiextensionsclient.Interface
Logger micrologger.Logger
CertsSearcher certs.Interface
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some refactoring because of the injection of the certs searcher. This is because different controllers need it.

if err != nil {
return nil, microerror.Mask(err)
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the latest resource package got introduced it was not properly wired. Doing this here to get going.

*controller.Controller
}

func NewDeleter(config DeleterConfig) (*Deleter, error) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the new deleter controller which does the job of the node controller which we remove now.

Watcher: config.G8sClient.ProviderV1alpha1().KVMConfigs(""),

RateWait: informer.DefaultRateWait,
ResyncPeriod: 30 * time.Second,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We check for nodes to be deleted every 30 seconds.

"github.com/giantswarm/kvm-operator/service/controller/v13/key"
)

func (r *Resource) EnsureCreated(ctx context.Context, obj interface{}) error {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here happens the magic.

// Fetch the list of pods running on the host cluster. These pods serve VMs
// which in turn run the guest cluster nodes. We use the pods to compare them
// against the guest cluster nodes below.
var pods []corev1.Pod
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old implementation was about some cloud provider interface. This effectively listed pods. Lots of boilerplate and complexity and indirection is gone now. We simply compare pods with nodes. This is better now.

var guestAPINotAvailableError = microerror.New("guest API not available")

// IsGuestAPINotAvailable asserts guestAPINotAvailableError.
func IsGuestAPINotAvailable(err error) bool {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took this from cluster-operator and extended it with the EOF checks. We should separate this and use the same mechanism everywhere. I can imagine the EOF errors are a thing in other places as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fgimenez @rossf7 WDYT? Where to put it? giantswarm/errors?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xh3b4sd Yes we need this logic in multiple places including helmclient. I've already created the repo but I got distracted by other things. I'll pick this up next.

https://github.com/giantswarm/errors

// it and assume we are done.
// informer's watch event is outdated and the pod got already deleted in
// the Kubernetes API. This is a normal transition behaviour, so we just
// ignore it and assume we are done.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Boyscouting.

@@ -12,8 +15,6 @@ import (
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/rest"

"github.com/giantswarm/apiextensions/pkg/clientset/versioned"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Boyscouting.

@xh3b4sd xh3b4sd requested review from a team June 7, 2018 17:39
Copy link
Contributor

@r7vme r7vme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM,

one thing that are still to be covered is complex network thing described here.

giantswarm/kvm-operator-node-controller#5

I've fixed it by adding healthz endpoint and killing node-controller pod if there were errors connecting guest API. Problem is that k8sclient multiplexes multiple http connections thru one TCP connection and when TCP connection dead (but in established state), there is no way to force use new TCP connection.

See issues in client-go
kubernetes/client-go#342
kubernetes/client-go#374

As workaround, is it possible to kill go routine or reinitialize completely new instance of k8sclient if we detected errors?

return nil
}

func doesNodeExistAsPod(pods []corev1.Pod, n corev1.Node) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how often we need this, but there is also second use case, when we want to clean up node. When pod is not in Running phase. this covers situations, when pod crashed or stuck in terminating for some reason.

https://github.com/giantswarm/kvm-operator-node-controller/blob/master/provider/instances.go#L70

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a test case for this? I ensured all two test cases you mentioned in the original issue are covered. Anyway, this support is super easy to add. Not here, but above in the loop where we already check for the ready condition. Thanks for the hint. I will add this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@tuommaki tuommaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite a big diff. In general looks good. One note about possibly missing resource cancellation.

if IsGuestAPINotAvailable(err) {
r.logger.LogCtx(ctx, "level", "debug", "message", "guest cluster is not available")
r.logger.LogCtx(ctx, "level", "debug", "message", "canceling resource for custom object")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is resource cancelling missing here or should the above log message be changed_

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Canceling works with the default resource by simply returning. The context cancelation has to be used with CRUD resources. This here is no CRUD resource. See also https://github.com/giantswarm/operatorkit/blob/master/docs/control_flow_primitives.md#default-resources.

@xh3b4sd
Copy link
Contributor Author

xh3b4sd commented Jun 8, 2018

one thing that are still to be covered is complex network thing described here.

giantswarm/kvm-operator-node-controller#5

I've fixed it by adding healthz endpoint and killing node-controller pod if there were errors connecting guest API. Problem is that k8sclient multiplexes multiple http connections thru one TCP connection and when TCP connection dead (but in established state), there is no way to force use new TCP connection.

See issues in client-go
kubernetes/client-go#342
kubernetes/client-go#374

As workaround, is it possible to kill go routine or reinitialize completely new instance of k8sclient if we detected errors?

I remember the case but couldn't see any issue in this direction while testing yesterday. What would be the test case for this so I could maybe play it through? Also note that the design here is slightly different. The k8s client for the guest clusters is created from scratch all the time. So far I couldn't figure a problem. We also work in different goroutines all the time but I am not sure how this would affect connection handling. Back then I was quite baffled about the problem we saw and I didn't have an explanation for it. Maybe the garbage collection changed or is differently applied now.

@xh3b4sd xh3b4sd deployed to gorgoth June 8, 2018 09:44 Active
@xh3b4sd
Copy link
Contributor Author

xh3b4sd commented Jun 8, 2018

I tested this successfully on gorgoth again. I also played the update from 2.9.0 to 2.10.0 through and that looked really cool. The node controller deployment got also removed automatically.

@xh3b4sd xh3b4sd merged commit 3f0fc7c into master Jun 8, 2018
@xh3b4sd xh3b4sd deleted the nodes branch June 8, 2018 11:32
@r7vme
Copy link
Contributor

r7vme commented Jun 8, 2018

The k8s client for the guest clusters is created from scratch all the time.

If so than it should be fine. Problem was that client initialized only once in main routine.

@calvix
Copy link
Contributor

calvix commented Jun 11, 2018

I remember the case but couldn't see any issue in this direction while testing yesterday. What would be the test case for this so I could maybe play it through? Also note that the design here is slightly different. The k8s client for the guest clusters is created from scratch all the time. So far I couldn't figure a problem. We also work in different goroutines all the time but I am not sure how this would affect connection handling. Back then I was quite baffled about the problem we saw and I didn't have an explanation for it. Maybe the garbage collection changed or is differently applied now.

AFAIK the test case was killing master pod and then see if after killing the master pod the controller is still working and cleaning nodes.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants