Fix grace period after kill for health check failure #7221

komuta · 2020-09-22T07:32:10Z

Following the change of task id / instance id format, task health status
was not properly tracked because bound to the same instance id after
successive kills for health check failure. This fix propose to address
the issue by tracking health status by task id to ensure it to cleaned
each time a task is terminated, whatever the reason.

JIRA Issues: MARATHON-8745

Following the change of task id / instance id format, task health status was not properly tracked because bound to the same instance id after successive kills for health check failure. This fix propose to address the issue by tracking health status by task id to ensure it to cleaned each time a task is terminated, whatever the reason. JIRA Issues: MARATHON-8745

timcharper · 2020-10-01T22:11:17Z

src/main/scala/mesosphere/marathon/core/health/impl/HealthCheckActor.scala

@@ -40,7 +41,7 @@ private[health] class HealthCheckActor(
  implicit val mat = ActorMaterializer()
  import context.dispatcher

-  val healthByInstanceId = TrieMap.empty[Instance.Id, Health]
+  val healthByInstanceId = TrieMap.empty[Task.Id, Health]


seems like we oughta rename this map?

timcharper · 2020-10-01T22:12:01Z

src/main/scala/mesosphere/marathon/core/health/impl/HealthCheckActor.scala

-    inactiveInstanceIds.foreach { inactiveId =>
-      healthByInstanceId.remove(inactiveId)
-    }
+    val activeTaskIds: Set[Task.Id] = instances.map(_.appTask).filter(_.isActive).map(_.taskId).to(Set)


Suggested change

val activeTaskIds: Set[Task.Id] = instances.map(_.appTask).filter(_.isActive).map(_.taskId).to(Set)

val activeTaskIds: Set[Task.Id] = instances.iterator.map(_.appTask).filter(_.isActive).map(_.taskId).to(Set)

timcharper · 2020-10-01T22:13:11Z

src/main/scala/mesosphere/marathon/core/health/impl/HealthCheckActor.scala

-      healthByInstanceId.remove(inactiveId)
-    }
+    val activeTaskIds: Set[Task.Id] = instances.map(_.appTask).filter(_.isActive).map(_.taskId).to(Set)
+    healthByInstanceId.retain((taskId, health) => activeTaskIds(taskId))


timcharper · 2020-10-01T22:17:42Z

src/main/scala/mesosphere/marathon/core/health/impl/HealthCheckActor.scala

@@ -192,7 +191,9 @@ private[health] class HealthCheckActor(
  }

  def receive: Receive = {
-    case GetInstanceHealth(instanceId) => sender() ! healthByInstanceId.getOrElse(instanceId, Health(instanceId))
+    case GetInstanceHealth(instanceId) =>
+      sender() ! healthByInstanceId.find(_._1.instanceId == instanceId)


Hmm... so we're going to scan a map? This seems like this could potentially perform poorly with lots of health checks. Maybe we should keep it indexed by instance id and evict the health status if the task id changes?

timcharper · 2020-10-01T22:18:31Z

Thanks for the PR! I've reviewed it and left some feedback, I'm most concerned about changing the index and introducing a scan for what was previously a map lookup. Thank you!

rohitjain25

def update(result: HealthResult): Health =
  result match {
    case Healthy(_, _, time, _) =>
      copy(
        firstSuccess = firstSuccess.orElse(Some(time)),
        lastSuccess = Some(time),
        consecutiveFailures = 0
      )
    case Unhealthy(_, _, cause, time, _) =>
      copy(
        lastFailure = Some(time),
        lastFailureCause = Some(cause)
      )
  }

timcharper suggested changes Oct 1, 2020

View reviewed changes

rohitjain25 reviewed Jan 6, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix grace period after kill for health check failure #7221

Fix grace period after kill for health check failure #7221

komuta commented Sep 22, 2020 •

edited by jira bot

Loading

timcharper Oct 1, 2020

timcharper Oct 1, 2020

timcharper Oct 1, 2020

timcharper Oct 1, 2020

timcharper commented Oct 1, 2020

rohitjain25 left a comment •

edited

Loading

	val activeTaskIds: Set[Task.Id] = instances.map(_.appTask).filter(_.isActive).map(_.taskId).to(Set)
	val activeTaskIds: Set[Task.Id] = instances.iterator.map(_.appTask).filter(_.isActive).map(_.taskId).to(Set)

Fix grace period after kill for health check failure #7221

Are you sure you want to change the base?

Fix grace period after kill for health check failure #7221

Conversation

komuta commented Sep 22, 2020 • edited by jira bot Loading

timcharper Oct 1, 2020

Choose a reason for hiding this comment

timcharper Oct 1, 2020

Choose a reason for hiding this comment

timcharper Oct 1, 2020

Choose a reason for hiding this comment

timcharper Oct 1, 2020

Choose a reason for hiding this comment

timcharper commented Oct 1, 2020

rohitjain25 left a comment • edited Loading

Choose a reason for hiding this comment

komuta commented Sep 22, 2020 •

edited by jira bot

Loading

rohitjain25 left a comment •

edited

Loading