Thread leak in logs-drain (potentially due to TimerTask bug) #178

bnferguson · 2018-05-25T16:33:32Z

Initially investigated in https://github.com/travis-pro/travis-enterprise/issues/299 (apologies for the private link)

When logs-drain is left long enough it begins to leak threads. Working with the customer and on our own test instance we came across ruby-concurrency/concurrent-ruby#639. I estimated that this may have to do with the rabbit reconnection bits being longer than the execution and time out of the timer task. This did seem to help initially but we still had a problem. For now we hot patched the issue in 2.2.1 by restarting the service every 8 hours. Heroku does a courtesy restart every 24 which is probably why we never saw this issue in hosted - also seem like busy servers don't seem to have this issue which is why I thought it was a rabbitmq heartbeat timeout issue (and we used to have a thread leak around that too that we fixed in #171.

After diving in I think we may need to consider rewriting the loop in drain to not use TimerTask (as this is a very old issue) as one project did (ruby-shoryuken/shoryuken#338 and ruby-shoryuken/shoryuken#345). Or fix TimerTask in concurrent-ruby.

Here's some copied context from the private issue:

Left over night with higher execution timer and no connection recovery [in bunny] it looks like it still crashed (new pid) and is sitting at 60 some pids [instead of 30]. Looking at the threads it looks like this:

root@te-main:/# ps -T -p 10366
  PID  SPID TTY          TIME CMD
10366 10366 ?        00:00:08 ruby
10366 11046 ?        00:00:03 ruby-timer-thr
10366 11160 ?        00:00:00 exceptions.rb:*
10366 11267 ?        00:16:25 ruby_thread_po*
10366   495 ?        00:00:10 ruby_thread_po*
10366   908 ?        00:00:09 ruby_thread_po*
10366  1067 ?        00:00:07 ruby_thread_po*
10366  1763 ?        00:00:06 ruby_thread_po*
10366  1908 ?        00:00:06 ruby_thread_po*
10366  2081 ?        00:00:06 ruby_thread_po*
10366  2082 ?        00:00:05 ruby_thread_po*
10366  2164 ?        00:00:05 ruby_thread_po*
10366  2173 ?        00:00:04 ruby_thread_po*
10366  2641 ?        00:00:04 ruby_thread_po*
10366  2643 ?        00:00:03 ruby_thread_po*
10366  2650 ?        00:00:04 ruby_thread_po*
10366  2658 ?        00:00:03 ruby_thread_po*
10366  2660 ?        00:00:02 ruby_thread_po*
10366  2662 ?        00:00:04 ruby_thread_po*
10366  2950 ?        00:00:02 ruby_thread_po*
10366  2951 ?        00:00:02 ruby_thread_po*
10366  2967 ?        00:00:02 ruby_thread_po*
10366  3198 ?        00:00:01 ruby_thread_po*
10366  3199 ?        00:00:00 ruby_thread_po*
10366  3201 ?        00:00:01 ruby_thread_po*
10366  3202 ?        00:00:00 ruby_thread_po*
10366  3204 ?        00:00:00 ruby_thread_po*
10366  3271 ?        00:00:01 ruby_thread_po*
10366  3273 ?        00:00:01 ruby_thread_po*
10366  3366 ?        00:00:00 ruby_thread_po*
10366  3367 ?        00:00:00 ruby_thread_po*
10366  3368 ?        00:00:00 ruby_thread_po*
10366  3369 ?        00:00:00 ruby_thread_po*
10366  3370 ?        00:00:00 ruby_thread_po*
10366  3371 ?        00:00:00 ruby_thread_po*
10366  3372 ?        00:00:00 ruby_thread_po*
10366  3373 ?        00:00:00 ruby_thread_po*
10366  3375 ?        00:00:00 ruby_thread_po*
10366  3376 ?        00:00:01 ruby_thread_po*
10366  3377 ?        00:00:00 ruby_thread_po*
10366  3378 ?        00:00:00 ruby_thread_po*
10366  3379 ?        00:00:00 ruby_thread_po*
10366  3380 ?        00:00:00 ruby_thread_po*
10366  3381 ?        00:00:00 ruby_thread_po*
10366  3382 ?        00:00:01 ruby_thread_po*
10366  3383 ?        00:00:00 ruby_thread_po*
10366  3384 ?        00:00:00 ruby_thread_po*
10366  3385 ?        00:00:00 ruby_thread_po*
10366  3386 ?        00:00:00 ruby_thread_po*
10366  3387 ?        00:00:00 ruby_thread_po*
10366  3388 ?        00:00:00 ruby_thread_po*
10366  3389 ?        00:00:01 ruby_thread_po*
10366  3390 ?        00:00:01 ruby_thread_po*
10366  3586 ?        00:00:00 heartbeat_send*
10366  3587 ?        00:00:00 reader_loop.rb*
10366  3589 ?        00:00:00 consumer_work_*
10366  3591 ?        00:00:00 heartbeat_send*
10366  3592 ?        00:00:00 reader_loop.rb*
10366  3594 ?        00:00:00 consumer_work_*
10366  3596 ?        00:00:00 heartbeat_send*
10366  3597 ?        00:00:00 reader_loop.rb*
10366  3599 ?        00:00:00 consumer_work_*
10366  3601 ?        00:00:00 heartbeat_send*
10366  3602 ?        00:00:00 reader_loop.rb*
10366  3604 ?        00:00:00 consumer_work_*
10366  3625 ?        00:00:00 heartbeat_send*
10366  3626 ?        00:00:00 reader_loop.rb*
10366  3628 ?        00:00:00 consumer_work_*

Beginning to feel our only option may be rewriting this loop away from TimerTask as shoryuken did. Try to put together a spike for that today or tomorrow.

Profiling the top CPU user up there:

Summary of profiling data so far:
% self  % total  name
 95.34   100.00  <c function> - unknown
  0.49    71.44  synchronize - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/synchronization/mri_lockable_object.rb
  0.31    71.10  block in synchronize - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/synchronization/mri_lockable_object.rb
  0.21     5.32  safe_execute - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/ivar.rb
  0.19     4.33  execute_task - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/timer_task.rb
  0.16    66.80  block in process_tasks - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/executor/timer_set.rb
  0.16     0.61  block in initialize - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/scheduled_task.rb
  0.12     1.74  execute - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/scheduled_task.rb
  0.09    64.98  block in ns_wait_until - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/synchronization/abstract_lockable_object.rb
  0.09     0.95  format_l2met - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/bundler/gems/travis-logger-b589e0ca0e9f/lib/travis/logger/format.rb
  0.09     0.52  ordered? - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/collection/ruby_non_concurrent_priority_queue.rb
  0.09     0.38  schedule_time - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/scheduled_task.rb
  0.09     0.22  initialize - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/atomic/event.rb
  0.09     0.22  block (2 levels) in l2met_args_to_record - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/bundler/gems/travis-logger-b589e0ca0e9f/lib/travis/logger/format.rb
  0.07     1.54  new - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/synchronization/object.rb
  0.07     0.67  sink - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/collection/ruby_non_concurrent_priority_queue.rb
  0.07     0.52  block in post - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/executor/ruby_executor_service.rb
  0.07     0.18  block (3 levels) in <class:Logger> - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/bundler/gems/travis-logger-b589e0ca0e9f/lib/travis/logger.rb
  0.06    65.08  block in wait - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/atomic/event.rb
  0.06     4.45  block in execute - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/executor/safe_task_executor.rb

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thread leak in logs-drain (potentially due to TimerTask bug) #178

Thread leak in logs-drain (potentially due to TimerTask bug) #178

bnferguson commented May 25, 2018

Thread leak in logs-drain (potentially due to TimerTask bug) #178

Thread leak in logs-drain (potentially due to TimerTask bug) #178

Comments

bnferguson commented May 25, 2018