Skip to content

Conversation

style95
Copy link
Member

@style95 style95 commented Aug 29, 2025

This is to make the system more stable. There are generally many numbers of queues running in a scheduler.
One queue will spawn multiple actors, and there could be a huge number of actors.
There are some critical actors to maintain the sanity of the system like akka-heartbeat.
This change will ensure isolating the performance impact of memory queues and guaranteeing the akka-heartbeat is not being starved.

Description

Related issue and scope

  • I opened an issue to propose and discuss this change (#????)

My changes affect the following components

  • API
  • Controller
  • Message Bus (e.g., Kafka)
  • Loadbalancer
  • Scheduler
  • Invoker
  • Intrinsic actions (e.g., sequences, conductors)
  • Data stores (e.g., CouchDB)
  • Tests
  • Deployment
  • CLI
  • General tooling
  • Documentation

Types of changes

  • Bug fix (generally a non-breaking change which closes an issue).
  • Enhancement or new feature (adds new functionality).
  • Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

  • I signed an Apache CLA.
  • I reviewed the style guides and followed the recommendations (Travis CI will check :).
  • I added tests to cover my changes.
  • My changes require further changes to the documentation.
  • I updated the documentation where necessary.

@@ -41,6 +41,26 @@ akka.http {
preview.enable-http2 = on
parsing.illegal-header-warnings = off
}

cluster {
use-dispatcher = "dispatchers.heartbeat-dispatcher"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assigned a separate dispatcher for the akka-cluster heartbeat.

@@ -72,7 +92,7 @@ kamon {
service = "openwhisk-statsd"
}
metric {
tick-interval = 1 second
tick-interval = 10 second
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the main changes. According to my analysis, it seems there is a leak in the Kamon metric.
As a scheduler is running longer, there are a huge number of MetricSnapshot instances created.

Below is the heap dump for a scheduler when it faced a thread starvation.

image

There are 97M numbers of scala.collection.immutable.$colon$colon and 96M numbers of kamon.metric.Instrument$Snapshot.

The reference dominator of scala.collection.immutable.$colon$colon is mostly MetricSnapshot.

image

Also, kamon.metric.Instrument$Snapshot is mostly referenced by scala.collection.immutable.$colon$colon, in turn, it results in MetricSnapshot.

image

All components other than MemoryQueue are emitting metrics with 10s intervals. So I updated the metric emission interval of MemoryQueue to 10s as well.
Since now we emit all metrics every 10s, we don't need to use a smaller tick interval like 1s because it will try to create a snapshot every 1 second, but the metric itself remains unchanged for 10s because we don't emit them in the middle of the interval(10s).

@@ -181,7 +182,7 @@ class MemoryQueue(private val etcdClient: EtcdClient,
private[queue] var limit: Option[Int] = None
private[queue] var initialized = false

private val logScheduler: Cancellable = context.system.scheduler.scheduleWithFixedDelay(0.seconds, 1.seconds) { () =>
private val logScheduler: Cancellable = context.system.scheduler.scheduleWithFixedDelay(0.seconds, 10.seconds) { () =>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was emitting 5 metrics every 1 second. If there are 400 queues running, they will emit around 2000 metrics per second. Considering the fact that one memory queue will spawn multiple sub-actors, and the combination with the use of CachedThreadPool, which spawns an unlimited number of actors on demand, it caused thread starvation.

@@ -0,0 +1,37 @@
dispatchers {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I introduced separate dispatchers to guarantee performance and minimize performance impact.
I used the fork-join-executor as their jobs are mostly CPU-bound work.

Copy link
Member

@dgrove-oss dgrove-oss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the detailed comments explaining the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants