delete guest cluster nodes from Kubernetes API #484

xh3b4sd · 2018-05-30T17:14:18Z

Towards giantswarm/kvm-operator-node-controller#2.

# Conflicts: # Gopkg.lock

xh3b4sd

This got successfully tested on gorgoth.

xh3b4sd · 2018-06-07T17:30:54Z

Gopkg.toml

@@ -93,6 +103,10 @@
  name = "k8s.io/client-go"
  version = "kubernetes-1.9.3"

+[[constraint]]
+  branch = "release-1.9"
+  name = "k8s.io/kubernetes"


I had to pin this. There was no other way I found to get the stupid deps right. This maybe improves again with the move to 1.10.4.

xh3b4sd · 2018-06-07T17:31:23Z

service/controller/cluster.go

-	K8sClient    kubernetes.Interface
-	K8sExtClient apiextensionsclient.Interface
-	Logger       micrologger.Logger
+	CertsSearcher certs.Interface


There is some refactoring because of the injection of the certs searcher. This is because different controllers need it.

xh3b4sd · 2018-06-07T17:32:11Z

service/controller/cluster.go

+		if err != nil {
+			return nil, microerror.Mask(err)
+		}
+	}


When the latest resource package got introduced it was not properly wired. Doing this here to get going.

xh3b4sd · 2018-06-07T17:32:37Z

service/controller/deleter.go

+	*controller.Controller
+}
+
+func NewDeleter(config DeleterConfig) (*Deleter, error) {


This is the new deleter controller which does the job of the node controller which we remove now.

xh3b4sd · 2018-06-07T17:32:58Z

service/controller/deleter.go

+			Watcher: config.G8sClient.ProviderV1alpha1().KVMConfigs(""),
+
+			RateWait:     informer.DefaultRateWait,
+			ResyncPeriod: 30 * time.Second,


We check for nodes to be deleted every 30 seconds.

xh3b4sd · 2018-06-07T17:35:54Z

service/controller/v13/resource/node/create.go

+	"github.com/giantswarm/kvm-operator/service/controller/v13/key"
+)
+
+func (r *Resource) EnsureCreated(ctx context.Context, obj interface{}) error {


Here happens the magic.

xh3b4sd · 2018-06-07T17:36:58Z

service/controller/v13/resource/node/create.go

+	// Fetch the list of pods running on the host cluster. These pods serve VMs
+	// which in turn run the guest cluster nodes. We use the pods to compare them
+	// against the guest cluster nodes below.
+	var pods []corev1.Pod


The old implementation was about some cloud provider interface. This effectively listed pods. Lots of boilerplate and complexity and indirection is gone now. We simply compare pods with nodes. This is better now.

xh3b4sd · 2018-06-07T17:38:04Z

service/controller/v13/resource/node/error.go

+var guestAPINotAvailableError = microerror.New("guest API not available")
+
+// IsGuestAPINotAvailable asserts guestAPINotAvailableError.
+func IsGuestAPINotAvailable(err error) bool {


I took this from cluster-operator and extended it with the EOF checks. We should separate this and use the same mechanism everywhere. I can imagine the EOF errors are a thing in other places as well.

@fgimenez @rossf7 WDYT? Where to put it? giantswarm/errors?

@xh3b4sd Yes we need this logic in multiple places including helmclient. I've already created the repo but I got distracted by other things. I'll pick this up next.

https://github.com/giantswarm/errors

xh3b4sd · 2018-06-07T17:38:21Z

service/controller/v13/resource/pod/delete.go

-			// it and assume we are done.
+			// informer's watch event is outdated and the pod got already deleted in
+			// the Kubernetes API. This is a normal transition behaviour, so we just
+			// ignore it and assume we are done.


Boyscouting.

xh3b4sd · 2018-06-07T17:38:27Z

service/service.go

@@ -12,8 +15,6 @@ import (
 	"k8s.io/client-go/kubernetes"
 	"k8s.io/client-go/rest"

-	"github.com/giantswarm/apiextensions/pkg/clientset/versioned"


Boyscouting.

r7vme

LGTM,

one thing that are still to be covered is complex network thing described here.

giantswarm/kvm-operator-node-controller#5

I've fixed it by adding healthz endpoint and killing node-controller pod if there were errors connecting guest API. Problem is that k8sclient multiplexes multiple http connections thru one TCP connection and when TCP connection dead (but in established state), there is no way to force use new TCP connection.

See issues in client-go
kubernetes/client-go#342
kubernetes/client-go#374

As workaround, is it possible to kill go routine or reinitialize completely new instance of k8sclient if we detected errors?

r7vme · 2018-06-08T07:33:09Z

service/controller/v13/resource/node/create.go

+	return nil
+}
+
+func doesNodeExistAsPod(pods []corev1.Pod, n corev1.Node) bool {


Not sure how often we need this, but there is also second use case, when we want to clean up node. When pod is not in Running phase. this covers situations, when pod crashed or stuck in terminating for some reason.

https://github.com/giantswarm/kvm-operator-node-controller/blob/master/provider/instances.go#L70

Is there a test case for this? I ensured all two test cases you mentioned in the original issue are covered. Anyway, this support is super easy to add. Not here, but above in the loop where we already check for the ready condition. Thanks for the hint. I will add this.

tuommaki

Quite a big diff. In general looks good. One note about possibly missing resource cancellation.

tuommaki · 2018-06-08T07:43:56Z

service/controller/v13/resource/node/create.go

+		if IsGuestAPINotAvailable(err) {
+			r.logger.LogCtx(ctx, "level", "debug", "message", "guest cluster is not available")
+			r.logger.LogCtx(ctx, "level", "debug", "message", "canceling resource for custom object")
+


Is resource cancelling missing here or should the above log message be changed_

Canceling works with the default resource by simply returning. The context cancelation has to be used with CRUD resources. This here is no CRUD resource. See also https://github.com/giantswarm/operatorkit/blob/master/docs/control_flow_primitives.md#default-resources.

xh3b4sd · 2018-06-08T08:08:02Z

one thing that are still to be covered is complex network thing described here.

giantswarm/kvm-operator-node-controller#5

I've fixed it by adding healthz endpoint and killing node-controller pod if there were errors connecting guest API. Problem is that k8sclient multiplexes multiple http connections thru one TCP connection and when TCP connection dead (but in established state), there is no way to force use new TCP connection.

See issues in client-go
kubernetes/client-go#342
kubernetes/client-go#374

As workaround, is it possible to kill go routine or reinitialize completely new instance of k8sclient if we detected errors?

I remember the case but couldn't see any issue in this direction while testing yesterday. What would be the test case for this so I could maybe play it through? Also note that the design here is slightly different. The k8s client for the guest clusters is created from scratch all the time. So far I couldn't figure a problem. We also work in different goroutines all the time but I am not sure how this would affect connection handling. Back then I was quite baffled about the problem we saw and I didn't have an explanation for it. Maybe the garbage collection changed or is differently applied now.

xh3b4sd · 2018-06-08T11:32:12Z

I tested this successfully on gorgoth again. I also played the update from 2.9.0 to 2.10.0 through and that looked really cool. The node controller deployment got also removed automatically.

r7vme · 2018-06-08T13:12:16Z

The k8s client for the guest clusters is created from scratch all the time.

If so than it should be fine. Problem was that client initialized only once in main routine.

calvix · 2018-06-11T10:02:08Z

I remember the case but couldn't see any issue in this direction while testing yesterday. What would be the test case for this so I could maybe play it through? Also note that the design here is slightly different. The k8s client for the guest clusters is created from scratch all the time. So far I couldn't figure a problem. We also work in different goroutines all the time but I am not sure how this would affect connection handling. Back then I was quite baffled about the problem we saw and I didn't have an explanation for it. Maybe the garbage collection changed or is differently applied now.

AFAIK the test case was killing master pod and then see if after killing the master pod the controller is still working and cleaning nodes.

xh3b4sd added 5 commits May 30, 2018 15:49

updates vendor deps

fb19da7

wip

ed72cb5

updates vendor deps

31dc39b

wip

c642cb3

Merge branch 'master' into nodes

c32eb73

# Conflicts: # Gopkg.lock

xh3b4sd self-assigned this May 30, 2018

xh3b4sd added 15 commits May 30, 2018 20:10

updates vendor deps

bc4f2fc

Merge branch 'master' into nodes

8fa2d31

moves new resource to v13

3e007ba

updates vendor deps

986aae0

compiles

e478b4c

adds more logging

d19173c

removes deployment of node controller

9aeb4e2

comments

f41debc

comments

a45cb36

naming

971b6b4

tests

67d51d6

defaults

5ab4c4e

naming

eb01bed

compiles

a573499

removes unused error

f902072

xh3b4sd deployed to gorgoth June 7, 2018 14:01 Active

xh3b4sd added 2 commits June 7, 2018 16:33

updates vendor deps

59937b6

removes unnecessary cloud provider

a9cd0e3

xh3b4sd deployed to gorgoth June 7, 2018 14:44 Active

xh3b4sd added 2 commits June 7, 2018 17:01

adds missing configuration

0c09a7f

brings back service test

00edc97

xh3b4sd deployed to gorgoth June 7, 2018 15:15 Active

exposes latest version bundle

1112b14

xh3b4sd deployed to gorgoth June 7, 2018 15:39 Active

wires controllers

e0bb1d9

xh3b4sd deployed to gorgoth June 7, 2018 16:03 Active

fixes controller rest client

8b9d671

xh3b4sd deployed to gorgoth June 7, 2018 16:21 Active

cancels resource when guest API not ready

eb9fcc8

xh3b4sd deployed to gorgoth June 7, 2018 16:48 Active

extends error handling

71d8c19

xh3b4sd deployed to gorgoth June 7, 2018 17:11 Active

improves logging

92e0a82

xh3b4sd commented Jun 7, 2018

View reviewed changes

xh3b4sd requested review from a team June 7, 2018 17:39

r7vme approved these changes Jun 8, 2018

View reviewed changes

tuommaki approved these changes Jun 8, 2018

View reviewed changes

checks if pod of node is running

ca9e906

xh3b4sd deployed to gorgoth June 8, 2018 09:44 Active

xh3b4sd merged commit 3f0fc7c into master Jun 8, 2018

xh3b4sd deleted the nodes branch June 8, 2018 11:32

rossf7 mentioned this pull request Jun 11, 2018

Initial: add api.IsGuestAPINotAvailable matcher giantswarm/errors#1

Merged

afritzler mentioned this pull request Jul 24, 2018

initializeShootClient stuck in Timeout Error in the Cluster Creation Process gardener/gardener#288

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

delete guest cluster nodes from Kubernetes API #484

delete guest cluster nodes from Kubernetes API #484

xh3b4sd commented May 30, 2018

xh3b4sd left a comment

xh3b4sd Jun 7, 2018

xh3b4sd Jun 7, 2018

xh3b4sd Jun 7, 2018

xh3b4sd Jun 7, 2018

xh3b4sd Jun 7, 2018

xh3b4sd Jun 7, 2018

xh3b4sd Jun 7, 2018

xh3b4sd Jun 7, 2018

xh3b4sd Jun 8, 2018

rossf7 Jun 11, 2018

xh3b4sd Jun 7, 2018

xh3b4sd Jun 7, 2018

r7vme left a comment

r7vme Jun 8, 2018

xh3b4sd Jun 8, 2018

xh3b4sd Jun 8, 2018

tuommaki left a comment

tuommaki Jun 8, 2018

xh3b4sd Jun 8, 2018

xh3b4sd commented Jun 8, 2018 •

edited

Loading

xh3b4sd commented Jun 8, 2018

r7vme commented Jun 8, 2018

calvix commented Jun 11, 2018

delete guest cluster nodes from Kubernetes API #484

delete guest cluster nodes from Kubernetes API #484

Conversation

xh3b4sd commented May 30, 2018

xh3b4sd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

r7vme left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tuommaki left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xh3b4sd commented Jun 8, 2018 • edited Loading

xh3b4sd commented Jun 8, 2018

r7vme commented Jun 8, 2018

calvix commented Jun 11, 2018

xh3b4sd commented Jun 8, 2018 •

edited

Loading