-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Use separate dispatchers for MemoryQueue, QueueManager, and Akka heartbeat #5549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@@ -41,6 +41,26 @@ akka.http { | |||
preview.enable-http2 = on | |||
parsing.illegal-header-warnings = off | |||
} | |||
|
|||
cluster { | |||
use-dispatcher = "dispatchers.heartbeat-dispatcher" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assigned a separate dispatcher for the akka-cluster heartbeat.
@@ -72,7 +92,7 @@ kamon { | |||
service = "openwhisk-statsd" | |||
} | |||
metric { | |||
tick-interval = 1 second | |||
tick-interval = 10 second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one of the main changes. According to my analysis, it seems there is a leak in the Kamon metric.
As a scheduler is running longer, there are a huge number of MetricSnapshot instances created.
Below is the heap dump for a scheduler when it faced a thread starvation.

There are 97M numbers of scala.collection.immutable.$colon$colon
and 96M numbers of kamon.metric.Instrument$Snapshot
.
The reference dominator of scala.collection.immutable.$colon$colon
is mostly MetricSnapshot
.

Also, kamon.metric.Instrument$Snapshot
is mostly referenced by scala.collection.immutable.$colon$colon
, in turn, it results in MetricSnapshot
.

All components other than MemoryQueue are emitting metrics with 10s intervals. So I updated the metric emission interval of MemoryQueue to 10s as well.
Since now we emit all metrics every 10s, we don't need to use a smaller tick interval like 1s because it will try to create a snapshot every 1 second, but the metric itself remains unchanged for 10s because we don't emit them in the middle of the interval(10s).
@@ -181,7 +182,7 @@ class MemoryQueue(private val etcdClient: EtcdClient, | |||
private[queue] var limit: Option[Int] = None | |||
private[queue] var initialized = false | |||
|
|||
private val logScheduler: Cancellable = context.system.scheduler.scheduleWithFixedDelay(0.seconds, 1.seconds) { () => | |||
private val logScheduler: Cancellable = context.system.scheduler.scheduleWithFixedDelay(0.seconds, 10.seconds) { () => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was emitting 5 metrics every 1 second. If there are 400 queues running, they will emit around 2000 metrics per second. Considering the fact that one memory queue will spawn multiple sub-actors, and the combination with the use of CachedThreadPool
, which spawns an unlimited number of actors on demand, it caused thread starvation.
@@ -0,0 +1,37 @@ | |||
dispatchers { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I introduced separate dispatchers to guarantee performance and minimize performance impact.
I used the fork-join-executor
as their jobs are mostly CPU-bound work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the detailed comments explaining the PR.
This is to make the system more stable. There are generally many numbers of queues running in a scheduler.
One queue will spawn multiple actors, and there could be a huge number of actors.
There are some critical actors to maintain the sanity of the system like akka-heartbeat.
This change will ensure isolating the performance impact of memory queues and guaranteeing the akka-heartbeat is not being starved.
Description
Related issue and scope
My changes affect the following components
Types of changes
Checklist: