You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When logs-drain is left long enough it begins to leak threads. Working with the customer and on our own test instance we came across ruby-concurrency/concurrent-ruby#639. I estimated that this may have to do with the rabbit reconnection bits being longer than the execution and time out of the timer task. This did seem to help initially but we still had a problem. For now we hot patched the issue in 2.2.1 by restarting the service every 8 hours. Heroku does a courtesy restart every 24 which is probably why we never saw this issue in hosted - also seem like busy servers don't seem to have this issue which is why I thought it was a rabbitmq heartbeat timeout issue (and we used to have a thread leak around that too that we fixed in #171.
After diving in I think we may need to consider rewriting the loop in drain to not use TimerTask (as this is a very old issue) as one project did (ruby-shoryuken/shoryuken#338 and ruby-shoryuken/shoryuken#345). Or fix TimerTask in concurrent-ruby.
Here's some copied context from the private issue:
Left over night with higher execution timer and no connection recovery [in bunny] it looks like it still crashed (new pid) and is sitting at 60 some pids [instead of 30]. Looking at the threads it looks like this:
Beginning to feel our only option may be rewriting this loop away from TimerTask as shoryuken did. Try to put together a spike for that today or tomorrow.
Profiling the top CPU user up there:
Summary of profiling data so far:
% self % total name
95.34 100.00 <c function> - unknown
0.49 71.44 synchronize - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/synchronization/mri_lockable_object.rb
0.31 71.10 block in synchronize - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/synchronization/mri_lockable_object.rb
0.21 5.32 safe_execute - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/ivar.rb
0.19 4.33 execute_task - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/timer_task.rb
0.16 66.80 block in process_tasks - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/executor/timer_set.rb
0.16 0.61 block in initialize - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/scheduled_task.rb
0.12 1.74 execute - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/scheduled_task.rb
0.09 64.98 block in ns_wait_until - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/synchronization/abstract_lockable_object.rb
0.09 0.95 format_l2met - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/bundler/gems/travis-logger-b589e0ca0e9f/lib/travis/logger/format.rb
0.09 0.52 ordered? - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/collection/ruby_non_concurrent_priority_queue.rb
0.09 0.38 schedule_time - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/scheduled_task.rb
0.09 0.22 initialize - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/atomic/event.rb
0.09 0.22 block (2 levels) in l2met_args_to_record - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/bundler/gems/travis-logger-b589e0ca0e9f/lib/travis/logger/format.rb
0.07 1.54 new - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/synchronization/object.rb
0.07 0.67 sink - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/collection/ruby_non_concurrent_priority_queue.rb
0.07 0.52 block in post - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/executor/ruby_executor_service.rb
0.07 0.18 block (3 levels) in <class:Logger> - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/bundler/gems/travis-logger-b589e0ca0e9f/lib/travis/logger.rb
0.06 65.08 block in wait - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/atomic/event.rb
0.06 4.45 block in execute - /usr/local/travis-logs/vendor/bundle/ruby/2.4.0/gems/concurrent-ruby-1.0.5/lib/concurrent/executor/safe_task_executor.rb
The text was updated successfully, but these errors were encountered:
Initially investigated in https://github.com/travis-pro/travis-enterprise/issues/299 (apologies for the private link)
When
logs-drain
is left long enough it begins to leak threads. Working with the customer and on our own test instance we came across ruby-concurrency/concurrent-ruby#639. I estimated that this may have to do with the rabbit reconnection bits being longer than the execution and time out of the timer task. This did seem to help initially but we still had a problem. For now we hot patched the issue in 2.2.1 by restarting the service every 8 hours. Heroku does a courtesy restart every 24 which is probably why we never saw this issue in hosted - also seem like busy servers don't seem to have this issue which is why I thought it was a rabbitmq heartbeat timeout issue (and we used to have a thread leak around that too that we fixed in #171.After diving in I think we may need to consider rewriting the loop in drain to not use
TimerTask
(as this is a very old issue) as one project did (ruby-shoryuken/shoryuken#338 and ruby-shoryuken/shoryuken#345). Or fixTimerTask
inconcurrent-ruby
.Here's some copied context from the private issue:
Left over night with higher execution timer and no connection recovery [in bunny] it looks like it still crashed (new pid) and is sitting at 60 some pids [instead of 30]. Looking at the threads it looks like this:
Beginning to feel our only option may be rewriting this loop away from
TimerTask
as shoryuken did. Try to put together a spike for that today or tomorrow.Profiling the top CPU user up there:
The text was updated successfully, but these errors were encountered: