watching stops without any notification #273

grosser · 2017-11-02T21:13:33Z

Already saw multiple times that the watcher just stops ... without crashing / notifying ...
Idk how to reproduce that, but it happens regularly ... and it does not happen for kube-proxy so either there is a bug in this library or kube-proxy go-lang code has some smart disconnect handling

atm using below and calling .restart every x minutes

class KubernetesWatcher
  def initialize(kuber_client, namespace)
    @kuber_client = kuber_client
    @namespace = namespace
  end

  def watch(&block)
    loop do
      @watcher = @kuber_client.watch_endpoints(namespace: @namespace)
      @watcher.each(&block)
    end
  end

  def restart
    @watcher.finish
  end
end

idk how to fix/debug this further but wanted to raise awareness.

simon3z · 2017-11-02T21:57:10Z

cc @cben @agrare @moolitayer

cben · 2017-11-02T22:35:19Z

@grosser when you say "just stops" do you mean .each { block... } returns at some point?
or keeps running but never calls the block?

Which kubernetes version are you connecting to? And which ruby you're running?

don't know if related but I run oc get pod --watch yesterday and noticed it just exited after some time.

grosser · 2017-11-02T22:37:31Z

idk if the block stops or it just idles ... I'll try to find out ... if that's the problem then it's easy to fix :D

…

On Thu, Nov 2, 2017 at 3:35 PM, Beni Cherniavsky-Paskin < ***@***.***> wrote: @grosser <https://github.com/grosser> when you say "just stops" do you mean .each { block... } returns at some point? or keeps running but never calls the block? Which kubernetes version are you connecting to? And which ruby you're running? don't know if related but I run oc get pod --watch yesterday and noticed it just exited after some time. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#273 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAsZ4AOcWzGSn82WkeHSt3-l4JrUOXVks5sykOngaJpZM4QQWhG> .

grosser · 2017-11-03T00:57:05Z

confirmed that the block just stops ... so having a reconnect by default or optional would be nice ... atm I'm just doing loop do ... watch ... end

simon3z · 2017-11-03T12:17:21Z

Actually I am not sure we want to add this in the lib. How would you know that there was a disconnection? (Maybe some message was lost, etc.). I would keep this as is.

grosser · 2017-11-03T14:52:11Z

then we should call that out in the readme ... "watcher will stop when it is disconnected, run it in a loop to make sure it stays running" or so

…

On Fri, Nov 3, 2017 at 5:17 AM, Federico Simoncelli < ***@***.***> wrote: Actually I am not sure we want to add this in the lib. How would you know that there was a disconnection? (Maybe some message was lost, etc.). I would keep this as is. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#273 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAsZwUVxUKbjxfeiSU_FDqn_PFVDLOyks5sywRSgaJpZM4QQWhG> .

simon3z · 2017-11-06T10:12:48Z

then we should call that out in the readme ... "watcher will stop when it
is disconnected, run it in a loop to make sure it stays running" or so

Definitely 👍

Eric-Fontana-Bose · 2017-11-15T18:59:10Z

I wrote a service which watches namespaces, for the purpose of emitting a continuous stream of events so a person (the next day) can see when his pod restarted, and for what reason, my complaint here is that yes, my watch loop is in an infinite loop, but when the loop starts again, you end up getting the same data you previously emitted because they overlap, this stinks, to work around it, I'm generating a hash and drop the "already read" events. Anyone have a better idea?

grosser · 2017-11-15T19:14:22Z

ignore first events after restart ... not perfect but maybe better if missing an event is not that critical

  loop do
      ignore_until = Time.now.to_f + 0.5 # re-processing happens in the first 0.4s
      kuber_client.watch_endpoints.each do |e|
        next if Time.now.to_f < ignore_until
        puts e
      end
  end

store the timestamp of the last received event and reject all before that on restart

moolitayer · 2017-11-16T14:11:03Z

I've seen this happening with curl after little over an hour in kubernetes v1.7.6+a08f5eeb62

$ time curl -k -s -H "Authorization: Bearer $OSH_TOKEN"     https://$OSH_HOST:8443/api/v1/pods?watch=139024
...
...
...
real	71m49.189s
user	0m0.336s
sys	0m0.332s

This suggests this is caused by a k8s issue.

moolitayer · 2017-11-16T14:48:23Z

@agrare @Ladas FYI the pod watch in ManageIQ will probably encounter this issue and restart the collection worker(which sounds asymptomatic)

agrare · 2017-11-16T14:51:02Z

Thanks @moolitayer, we're okay because if the thread exits it will be restarted and currently we don't process the initial list of pods.

This is a little concerning when we move to using watches as the primary mechanism for refresh because then we will be processing the entire initial set every hour. Basically the same scheduled full refresh we do today 😟

Stono · 2018-10-16T21:07:43Z

Hey,
So I just wanted to add my latest finding here, following a k8s upgrade.

The logs are below, but long story short, the latest resource is 44924022, however if you use that as your starting point, k8s returns 410 GONE (because these particularly resources haven't been updated in quite some time).

The only way to get a watcher started then is to use 0, which returns you ALL the objects.

You'll then need to filter the returned objects to be >= 44924022.

It's quite shit really, as you're potentially returning a lot of objects from the k8s api, especially when the connection times out so frequently (seemingly every 90seconds or so for CRDs in particular)

I, [2018-10-16T22:01:26.119618 #98517]  INFO -- : [App::ApiWatcher#watch_from] /apis/atcloud.io/v1/services watch will start from offset: 44924022
E, [2018-10-16T22:01:26.882520 #98517] ERROR -- : [App::ApiWatcher#block in watch_from] /apis/atcloud.io/v1/services getting 410 GONE responses for the latest offset 44924022, will restart from 0 which is the next known offset.  Some events may have been missed!
I, [2018-10-16T22:01:26.882659 #98517]  INFO -- : [App::ApiWatcher#watch_from] /apis/atcloud.io/v1/services watch will start from offset: 0
I, [2018-10-16T22:01:27.160224 #98517]  INFO -- : [Handlers::Slack#create_apis_atcloud_io_v1_services] shippr-simple
I, [2018-10-16T22:01:27.763382 #98517]  INFO -- : [SlackAPI#parse] 200 OK: ok
I, [2018-10-16T22:01:27.763739 #98517]  INFO -- : [Handlers::Slack#create_apis_atcloud_io_v1_services] platform-testing
I, [2018-10-16T22:01:28.021200 #98517]  INFO -- : [SlackAPI#parse] 200 OK: ok
W, [2018-10-16T22:03:00.473269 #98517]  WARN -- : [App::Runner#block in setup_watch_thread] /apis/atcloud.io/v1/services stopped, will restart from 44891278
I, [2018-10-16T22:03:00.473339 #98517]  INFO -- : [App::ApiWatcher#watch_from] /apis/atcloud.io/v1/services watch will start from offset: 44891278
E, [2018-10-16T22:03:00.916934 #98517] ERROR -- : [App::ApiWatcher#block in watch_from] /apis/atcloud.io/v1/services getting 410 GONE responses for the latest offset 44891278, will restart from 44924022 which is the next known offset.  Some events may have been missed!
I, [2018-10-16T22:03:00.917068 #98517]  INFO -- : [App::ApiWatcher#watch_from] /apis/atcloud.io/v1/services watch will start from offset: 44924022
E, [2018-10-16T22:03:01.411459 #98517] ERROR -- : [App::ApiWatcher#block in watch_from] /apis/atcloud.io/v1/services getting 410 GONE responses for the latest offset 44924022, will restart from 0 which is the next known offset.  Some events may have been missed!
I, [2018-10-16T22:03:01.411611 #98517]  INFO -- : [App::ApiWatcher#watch_from] /apis/atcloud.io/v1/services watch will start from offset: 0
I, [2018-10-16T22:03:01.641530 #98517]  INFO -- : [Handlers::Slack#create_apis_atcloud_io_v1_services] platform-testing
I, [2018-10-16T22:03:01.978240 #98517]  INFO -- : [SlackAPI#parse] 200 OK: ok
I, [2018-10-16T22:03:01.978561 #98517]  INFO -- : [Handlers::Slack#create_apis_atcloud_io_v1_services] shippr-simple
I, [2018-10-16T22:03:02.520127 #98517]  INFO -- : [SlackAPI#parse] 200 OK: ok
W, [2018-10-16T22:04:30.996193 #98517]  WARN -- : [App::Runner#block in setup_watch_thread] /apis/atcloud.io/v1/services stopped, will restart from 44924022
I, [2018-10-16T22:04:30.996243 #98517]  INFO -- : [App::ApiWatcher#watch_from] /apis/atcloud.io/v1/services watch will start from offset: 44924022
E, [2018-10-16T22:04:31.654225 #98517] ERROR -- : [App::ApiWatcher#block in watch_from] /apis/atcloud.io/v1/services getting 410 GONE responses for the latest offset 44924022, will restart from 0 which is the next known offset.  Some events may have been missed!

Stono · 2018-10-16T21:52:52Z

Does anyone know if resourceVersion is unique to the kind, eg a v1/secret, or is it incrementing across all objects?

Stono · 2018-10-16T22:13:08Z

reading kubernetes/kubernetes#55230 it looks like you can specify a timeoutSeconds on the watch api call?

Stono · 2018-10-17T07:02:12Z

Does anyone know if the resourceVersion ever rolls back over. At the moment quite a bit of my de-dupe code is basically resourceVersion > lastStoredResourceVersion?

If that value didn't just continually increment then eventually, my app will break

cben · 2018-10-18T09:09:31Z

The answers in this discussion: kubernetes/website#6540 (comment) explain some things well.

Does anyone know if resourceVersion is unique to the kind, eg a v1/secret, or is it incrementing across all objects?

By default shared for many collections, but might be separate, for example Events have independent version by default. Do not assume any relation between versions of different collections.

resourceVersion > lastStoredResourceVersion

If that value didn't just continually increment then eventually, my app will break

So this presently works AFAIK but k8s devs are very insistent that versions are opaque strings, which might not even be integers in future, and clients shouldn't compare them for order, only equality :-(

The only way to get a watcher started then is to use 0, which returns you ALL the objects.

~~I think (?) you can just start a watch without specifying resourceVersion, and then you'll only get changes after now?~~

You can also do a list request (get_foos) and take that version. That also loads all existing objects, but at least you get them separately from future versions.

If you don't care about the list of existing objects, there are a few api options to reduce the traffic but nothing yet in kubeclient I think: chunking Support API chunking #283 Add support for API paging with limit and continue #356, PartialObjectMetadata Promote alternate API representations to beta kubernetes/kubernetes#53224

cben · 2018-10-18T09:20:53Z

I think (?) you can just start a watch without specifying resourceVersion, and then you'll only get changes after now?

Nah, my bad, that would be too convenient to be true 😉 API docs say:
https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.12/#watch-list-pod-v1-core

resourceVersion — When specified with a watch call, shows changes that occur after that particular version of a resource. Defaults to changes from the beginning of history. When specified for list: - if unset, then the result is returned from remote storage based on quorum-read flag; - if it's 0, then we simply return what we currently have in cache, no guarantee; - if set to non zero, then the result is at least as fresh as given rv.

So there is slight difference between 0 and omitted but both give you old data...

grosser · 2019-12-20T03:34:13Z

FYI code I'm currently using to have cheap restarts and no re-plays on restart:

# frozen_string_literal: true

module Watching
  def initialize(*)
    super
    @started = latest_resource_version
  end

  private

  def watch_resource(type)
    loop do
      get_json(type, watch: true, resource_version: @started) do |notice|
        if notice[:type] == "ERROR" && notice.dig(:object, :code) == 410 # version was too old, watch will stop
          @started = latest_resource_version
        else
          @started = notice.dig(:object, :metadata, :resourceVersion)
          yield notice
        end
      end
    end
  end

  # get something cheap to get latest resourceVersion from the List object that is returned
  def latest_resource_version
    get_json(:namespaces, limit: 1, raw: true).dig(:metadata, :resourceVersion)
  end
end

mecampbellsoup · 2020-05-07T03:31:27Z

Already saw multiple times that the watcher just stops ... without crashing / notifying ...
Idk how to reproduce that, but it happens regularly ... and it does not happen for kube-proxy so either there is a bug in this library or kube-proxy go-lang code has some smart disconnect handling

atm using below and calling .restart every x minutes
class KubernetesWatcher
  def initialize(kuber_client, namespace)
    @kuber_client = kuber_client
    @namespace = namespace
  end

  def watch(&block)
    loop do
      @watcher = @kuber_client.watch_endpoints(namespace: @namespace)
      @watcher.each(&block)
    end
  end

  def restart
    @watcher.finish
  end
end
idk how to fix/debug this further but wanted to raise awareness.

Is there no way to accomplish this "persistent" connection without using a loop? 😞

aantn · 2021-03-19T15:52:53Z

FYI code I'm currently using to have cheap restarts and no re-plays on restart:

# frozen_string_literal: true

module Watching
  def initialize(*)
    super
    @started = latest_resource_version
  end

  private

  def watch_resource(type)
    loop do
      get_json(type, watch: true, resource_version: @started) do |notice|
        if notice[:type] == "ERROR" && notice.dig(:object, :code) == 410 # version was too old, watch will stop
          @started = latest_resource_version
        else
          @started = notice.dig(:object, :metadata, :resourceVersion)
          yield notice
        end
      end
    end
  end

  # get something cheap to get latest resourceVersion from the List object that is returned
  def latest_resource_version
    get_json(:namespaces, limit: 1, raw: true).dig(:metadata, :resourceVersion)
  end
end

@grosser isn't there a race condition here when the latest resourceVersion changes in between and therefore you miss updates?

grosser · 2021-03-28T21:52:38Z

yes, but that would happen very rarely (when things broken down)
could improve by comparing new versions to previous started

simon3z added the bug label Nov 2, 2017

simon3z removed the bug label Nov 3, 2017

grosser added a commit to grosser/kubeclient that referenced this issue Nov 9, 2017

document that watcher stops see ManageIQ#273

519a9f9

grosser added a commit to grosser/kubeclient that referenced this issue Nov 9, 2017

document that watcher stops see ManageIQ#273 and add each_with_retry

346c25d

grosser added a commit to grosser/kubeclient that referenced this issue Nov 9, 2017

document that watcher stops see ManageIQ#273 and add each_with_retry

91a9bb9

grosser added a commit to grosser/kubeclient that referenced this issue Nov 9, 2017

document that watcher stops see ManageIQ#273 and add each_with_retry

779277f

grosser added a commit to grosser/kubeclient that referenced this issue Nov 9, 2017

document that watcher stops see ManageIQ#273 and add each_with_retry

92b502c

simon3z mentioned this issue Nov 9, 2017

document that watcher stops see #273 and add each_with_retry #275

Open

grosser added a commit to grosser/kubeclient that referenced this issue Nov 27, 2017

document that watcher stops see ManageIQ#273 and add each_with_retry

927bdd0

cben added the bug label May 27, 2018

cben mentioned this issue Aug 20, 2018

Streaming refresh for kubernetes using watches ManageIQ/manageiq-providers-kubernetes#271

Merged

2 tasks

cben mentioned this issue Oct 9, 2018

Watch gets dropped quite frequently for v1/services #352

Closed

gmichels mentioned this issue Dec 5, 2018

Events in watch mode stop being detected splunk/splunk-connect-for-kubernetes#53

Closed

peter-moffatt mentioned this issue Feb 25, 2019

Potential fix for watcher failing to collect events after about an hour splunk/fluent-plugin-kubernetes-objects#13

Closed

cben mentioned this issue Dec 16, 2019

make using watcher easier/safer by passing a block directly #425

Merged

cben added the watches About watching k8s objects label Dec 16, 2019

cben pinned this issue Dec 16, 2019

cben mentioned this issue Dec 16, 2019

fix for watcher issue splunk/fluent-plugin-kubernetes-objects#21

Merged

fradee mentioned this issue Dec 23, 2019

Plugin gets stuck during 30 min - 1 h splunk/fluent-plugin-kubernetes-objects#44

Closed

julienlefur mentioned this issue Jun 26, 2020

Fluentd crashes when watching namespace metadata fabric8io/fluent-plugin-kubernetes_metadata_filter#239

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

watching stops without any notification #273

watching stops without any notification #273

grosser commented Nov 2, 2017 •

edited

Loading

simon3z commented Nov 2, 2017

cben commented Nov 2, 2017

grosser commented Nov 2, 2017 via email

grosser commented Nov 3, 2017

simon3z commented Nov 3, 2017

grosser commented Nov 3, 2017 via email

simon3z commented Nov 6, 2017

Eric-Fontana-Bose commented Nov 15, 2017 •

edited

Loading

grosser commented Nov 15, 2017

moolitayer commented Nov 16, 2017 •

edited

Loading

moolitayer commented Nov 16, 2017

agrare commented Nov 16, 2017

Stono commented Oct 16, 2018 •

edited

Loading

Stono commented Oct 16, 2018 •

edited

Loading

Stono commented Oct 16, 2018 •

edited

Loading

Stono commented Oct 17, 2018 •

edited

Loading

cben commented Oct 18, 2018 •

edited

Loading

cben commented Oct 18, 2018

grosser commented Dec 20, 2019 •

edited

Loading

mecampbellsoup commented May 7, 2020

aantn commented Mar 19, 2021

grosser commented Mar 28, 2021

watching stops without any notification #273

watching stops without any notification #273

Comments

grosser commented Nov 2, 2017 • edited Loading

simon3z commented Nov 2, 2017

cben commented Nov 2, 2017

grosser commented Nov 2, 2017 via email

grosser commented Nov 3, 2017

simon3z commented Nov 3, 2017

grosser commented Nov 3, 2017 via email

simon3z commented Nov 6, 2017

Eric-Fontana-Bose commented Nov 15, 2017 • edited Loading

grosser commented Nov 15, 2017

moolitayer commented Nov 16, 2017 • edited Loading

moolitayer commented Nov 16, 2017

agrare commented Nov 16, 2017

Stono commented Oct 16, 2018 • edited Loading

Stono commented Oct 16, 2018 • edited Loading

Stono commented Oct 16, 2018 • edited Loading

Stono commented Oct 17, 2018 • edited Loading

cben commented Oct 18, 2018 • edited Loading

cben commented Oct 18, 2018

grosser commented Dec 20, 2019 • edited Loading

mecampbellsoup commented May 7, 2020

aantn commented Mar 19, 2021

grosser commented Mar 28, 2021

grosser commented Nov 2, 2017 •

edited

Loading

Eric-Fontana-Bose commented Nov 15, 2017 •

edited

Loading

moolitayer commented Nov 16, 2017 •

edited

Loading

Stono commented Oct 16, 2018 •

edited

Loading

Stono commented Oct 16, 2018 •

edited

Loading

Stono commented Oct 16, 2018 •

edited

Loading

Stono commented Oct 17, 2018 •

edited

Loading

cben commented Oct 18, 2018 •

edited

Loading

grosser commented Dec 20, 2019 •

edited

Loading