Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

watching stops without any notification #273

Open
grosser opened this issue Nov 2, 2017 · 22 comments
Open

watching stops without any notification #273

grosser opened this issue Nov 2, 2017 · 22 comments
Labels
bug watches About watching k8s objects

Comments

@grosser
Copy link
Contributor

grosser commented Nov 2, 2017

Already saw multiple times that the watcher just stops ... without crashing / notifying ...
Idk how to reproduce that, but it happens regularly ... and it does not happen for kube-proxy so either there is a bug in this library or kube-proxy go-lang code has some smart disconnect handling

atm using below and calling .restart every x minutes

class KubernetesWatcher
  def initialize(kuber_client, namespace)
    @kuber_client = kuber_client
    @namespace = namespace
  end

  def watch(&block)
    loop do
      @watcher = @kuber_client.watch_endpoints(namespace: @namespace)
      @watcher.each(&block)
    end
  end

  def restart
    @watcher.finish
  end
end

idk how to fix/debug this further but wanted to raise awareness.

@simon3z
Copy link
Collaborator

simon3z commented Nov 2, 2017

cc @cben @agrare @moolitayer

@simon3z simon3z added the bug label Nov 2, 2017
@cben
Copy link
Collaborator

cben commented Nov 2, 2017

@grosser when you say "just stops" do you mean .each { block... } returns at some point?
or keeps running but never calls the block?

Which kubernetes version are you connecting to? And which ruby you're running?

don't know if related but I run oc get pod --watch yesterday and noticed it just exited after some time.

@grosser
Copy link
Contributor Author

grosser commented Nov 2, 2017 via email

@grosser
Copy link
Contributor Author

grosser commented Nov 3, 2017

confirmed that the block just stops ... so having a reconnect by default or optional would be nice ... atm I'm just doing loop do ... watch ... end

@simon3z simon3z removed the bug label Nov 3, 2017
@simon3z
Copy link
Collaborator

simon3z commented Nov 3, 2017

Actually I am not sure we want to add this in the lib. How would you know that there was a disconnection? (Maybe some message was lost, etc.). I would keep this as is.

@grosser
Copy link
Contributor Author

grosser commented Nov 3, 2017 via email

@simon3z
Copy link
Collaborator

simon3z commented Nov 6, 2017

then we should call that out in the readme ... "watcher will stop when it
is disconnected, run it in a loop to make sure it stays running" or so

Definitely 👍

grosser added a commit to grosser/kubeclient that referenced this issue Nov 9, 2017
grosser added a commit to grosser/kubeclient that referenced this issue Nov 9, 2017
grosser added a commit to grosser/kubeclient that referenced this issue Nov 9, 2017
grosser added a commit to grosser/kubeclient that referenced this issue Nov 9, 2017
grosser added a commit to grosser/kubeclient that referenced this issue Nov 9, 2017
@Eric-Fontana-Bose
Copy link

Eric-Fontana-Bose commented Nov 15, 2017

I wrote a service which watches namespaces, for the purpose of emitting a continuous stream of events so a person (the next day) can see when his pod restarted, and for what reason, my complaint here is that yes, my watch loop is in an infinite loop, but when the loop starts again, you end up getting the same data you previously emitted because they overlap, this stinks, to work around it, I'm generating a hash and drop the "already read" events. Anyone have a better idea?

@grosser
Copy link
Contributor Author

grosser commented Nov 15, 2017

  • ignore first events after restart ... not perfect but maybe better if missing an event is not that critical
  loop do
      ignore_until = Time.now.to_f + 0.5 # re-processing happens in the first 0.4s
      kuber_client.watch_endpoints.each do |e|
        next if Time.now.to_f < ignore_until
        puts e
      end
  end
  • store the timestamp of the last received event and reject all before that on restart

@moolitayer
Copy link
Collaborator

moolitayer commented Nov 16, 2017

I've seen this happening with curl after little over an hour in kubernetes v1.7.6+a08f5eeb62

$ time curl -k -s -H "Authorization: Bearer $OSH_TOKEN"     https://$OSH_HOST:8443/api/v1/pods?watch=139024
...
...
...
real	71m49.189s
user	0m0.336s
sys	0m0.332s

This suggests this is caused by a k8s issue.

@moolitayer
Copy link
Collaborator

@agrare @Ladas FYI the pod watch in ManageIQ will probably encounter this issue and restart the collection worker(which sounds asymptomatic)

@agrare
Copy link
Member

agrare commented Nov 16, 2017

Thanks @moolitayer, we're okay because if the thread exits it will be restarted and currently we don't process the initial list of pods.

This is a little concerning when we move to using watches as the primary mechanism for refresh because then we will be processing the entire initial set every hour. Basically the same scheduled full refresh we do today 😟

@Stono
Copy link

Stono commented Oct 16, 2018

Hey,
So I just wanted to add my latest finding here, following a k8s upgrade.

The logs are below, but long story short, the latest resource is 44924022, however if you use that as your starting point, k8s returns 410 GONE (because these particularly resources haven't been updated in quite some time).

The only way to get a watcher started then is to use 0, which returns you ALL the objects.

You'll then need to filter the returned objects to be >= 44924022.

It's quite shit really, as you're potentially returning a lot of objects from the k8s api, especially when the connection times out so frequently (seemingly every 90seconds or so for CRDs in particular)

I, [2018-10-16T22:01:26.119618 #98517]  INFO -- : [App::ApiWatcher#watch_from] /apis/atcloud.io/v1/services watch will start from offset: 44924022
E, [2018-10-16T22:01:26.882520 #98517] ERROR -- : [App::ApiWatcher#block in watch_from] /apis/atcloud.io/v1/services getting 410 GONE responses for the latest offset 44924022, will restart from 0 which is the next known offset.  Some events may have been missed!
I, [2018-10-16T22:01:26.882659 #98517]  INFO -- : [App::ApiWatcher#watch_from] /apis/atcloud.io/v1/services watch will start from offset: 0
I, [2018-10-16T22:01:27.160224 #98517]  INFO -- : [Handlers::Slack#create_apis_atcloud_io_v1_services] shippr-simple
I, [2018-10-16T22:01:27.763382 #98517]  INFO -- : [SlackAPI#parse] 200 OK: ok
I, [2018-10-16T22:01:27.763739 #98517]  INFO -- : [Handlers::Slack#create_apis_atcloud_io_v1_services] platform-testing
I, [2018-10-16T22:01:28.021200 #98517]  INFO -- : [SlackAPI#parse] 200 OK: ok
W, [2018-10-16T22:03:00.473269 #98517]  WARN -- : [App::Runner#block in setup_watch_thread] /apis/atcloud.io/v1/services stopped, will restart from 44891278
I, [2018-10-16T22:03:00.473339 #98517]  INFO -- : [App::ApiWatcher#watch_from] /apis/atcloud.io/v1/services watch will start from offset: 44891278
E, [2018-10-16T22:03:00.916934 #98517] ERROR -- : [App::ApiWatcher#block in watch_from] /apis/atcloud.io/v1/services getting 410 GONE responses for the latest offset 44891278, will restart from 44924022 which is the next known offset.  Some events may have been missed!
I, [2018-10-16T22:03:00.917068 #98517]  INFO -- : [App::ApiWatcher#watch_from] /apis/atcloud.io/v1/services watch will start from offset: 44924022
E, [2018-10-16T22:03:01.411459 #98517] ERROR -- : [App::ApiWatcher#block in watch_from] /apis/atcloud.io/v1/services getting 410 GONE responses for the latest offset 44924022, will restart from 0 which is the next known offset.  Some events may have been missed!
I, [2018-10-16T22:03:01.411611 #98517]  INFO -- : [App::ApiWatcher#watch_from] /apis/atcloud.io/v1/services watch will start from offset: 0
I, [2018-10-16T22:03:01.641530 #98517]  INFO -- : [Handlers::Slack#create_apis_atcloud_io_v1_services] platform-testing
I, [2018-10-16T22:03:01.978240 #98517]  INFO -- : [SlackAPI#parse] 200 OK: ok
I, [2018-10-16T22:03:01.978561 #98517]  INFO -- : [Handlers::Slack#create_apis_atcloud_io_v1_services] shippr-simple
I, [2018-10-16T22:03:02.520127 #98517]  INFO -- : [SlackAPI#parse] 200 OK: ok
W, [2018-10-16T22:04:30.996193 #98517]  WARN -- : [App::Runner#block in setup_watch_thread] /apis/atcloud.io/v1/services stopped, will restart from 44924022
I, [2018-10-16T22:04:30.996243 #98517]  INFO -- : [App::ApiWatcher#watch_from] /apis/atcloud.io/v1/services watch will start from offset: 44924022
E, [2018-10-16T22:04:31.654225 #98517] ERROR -- : [App::ApiWatcher#block in watch_from] /apis/atcloud.io/v1/services getting 410 GONE responses for the latest offset 44924022, will restart from 0 which is the next known offset.  Some events may have been missed!

@Stono
Copy link

Stono commented Oct 16, 2018

Does anyone know if resourceVersion is unique to the kind, eg a v1/secret, or is it incrementing across all objects?

@Stono
Copy link

Stono commented Oct 16, 2018

reading kubernetes/kubernetes#55230 it looks like you can specify a timeoutSeconds on the watch api call?

@Stono
Copy link

Stono commented Oct 17, 2018

Does anyone know if the resourceVersion ever rolls back over. At the moment quite a bit of my de-dupe code is basically resourceVersion > lastStoredResourceVersion?

If that value didn't just continually increment then eventually, my app will break

@cben
Copy link
Collaborator

cben commented Oct 18, 2018

The answers in this discussion: kubernetes/website#6540 (comment) explain some things well.

Does anyone know if resourceVersion is unique to the kind, eg a v1/secret, or is it incrementing across all objects?

By default shared for many collections, but might be separate, for example Events have independent version by default. Do not assume any relation between versions of different collections.

resourceVersion > lastStoredResourceVersion

If that value didn't just continually increment then eventually, my app will break

So this presently works AFAIK but k8s devs are very insistent that versions are opaque strings, which might not even be integers in future, and clients shouldn't compare them for order, only equality :-(

The only way to get a watcher started then is to use 0, which returns you ALL the objects.

I think (?) you can just start a watch without specifying resourceVersion, and then you'll only get changes after now?

You can also do a list request (get_foos) and take that version. That also loads all existing objects, but at least you get them separately from future versions.

@cben
Copy link
Collaborator

cben commented Oct 18, 2018

I think (?) you can just start a watch without specifying resourceVersion, and then you'll only get changes after now?

Nah, my bad, that would be too convenient to be true 😉 API docs say:
https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.12/#watch-list-pod-v1-core

resourceVersion — When specified with a watch call, shows changes that occur after that particular version of a resource. Defaults to changes from the beginning of history. When specified for list: - if unset, then the result is returned from remote storage based on quorum-read flag; - if it's 0, then we simply return what we currently have in cache, no guarantee; - if set to non zero, then the result is at least as fresh as given rv.

So there is slight difference between 0 and omitted but both give you old data...

@grosser
Copy link
Contributor Author

grosser commented Dec 20, 2019

FYI code I'm currently using to have cheap restarts and no re-plays on restart:

# frozen_string_literal: true

module Watching
  def initialize(*)
    super
    @started = latest_resource_version
  end

  private

  def watch_resource(type)
    loop do
      get_json(type, watch: true, resource_version: @started) do |notice|
        if notice[:type] == "ERROR" && notice.dig(:object, :code) == 410 # version was too old, watch will stop
          @started = latest_resource_version
        else
          @started = notice.dig(:object, :metadata, :resourceVersion)
          yield notice
        end
      end
    end
  end

  # get something cheap to get latest resourceVersion from the List object that is returned
  def latest_resource_version
    get_json(:namespaces, limit: 1, raw: true).dig(:metadata, :resourceVersion)
  end
end

@mecampbellsoup
Copy link

Already saw multiple times that the watcher just stops ... without crashing / notifying ...
Idk how to reproduce that, but it happens regularly ... and it does not happen for kube-proxy so either there is a bug in this library or kube-proxy go-lang code has some smart disconnect handling

atm using below and calling .restart every x minutes

class KubernetesWatcher
  def initialize(kuber_client, namespace)
    @kuber_client = kuber_client
    @namespace = namespace
  end

  def watch(&block)
    loop do
      @watcher = @kuber_client.watch_endpoints(namespace: @namespace)
      @watcher.each(&block)
    end
  end

  def restart
    @watcher.finish
  end
end

idk how to fix/debug this further but wanted to raise awareness.

Is there no way to accomplish this "persistent" connection without using a loop? 😞

@aantn
Copy link

aantn commented Mar 19, 2021

FYI code I'm currently using to have cheap restarts and no re-plays on restart:

# frozen_string_literal: true

module Watching
  def initialize(*)
    super
    @started = latest_resource_version
  end

  private

  def watch_resource(type)
    loop do
      get_json(type, watch: true, resource_version: @started) do |notice|
        if notice[:type] == "ERROR" && notice.dig(:object, :code) == 410 # version was too old, watch will stop
          @started = latest_resource_version
        else
          @started = notice.dig(:object, :metadata, :resourceVersion)
          yield notice
        end
      end
    end
  end

  # get something cheap to get latest resourceVersion from the List object that is returned
  def latest_resource_version
    get_json(:namespaces, limit: 1, raw: true).dig(:metadata, :resourceVersion)
  end
end

@grosser isn't there a race condition here when the latest resourceVersion changes in between and therefore you miss updates?

@grosser
Copy link
Contributor Author

grosser commented Mar 28, 2021

yes, but that would happen very rarely (when things broken down)
could improve by comparing new versions to previous started

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug watches About watching k8s objects
Projects
None yet
Development

No branches or pull requests

9 participants