-
Notifications
You must be signed in to change notification settings - Fork 843
Fix grace period after kill for health check failure #7221
base: master
Are you sure you want to change the base?
Conversation
Following the change of task id / instance id format, task health status was not properly tracked because bound to the same instance id after successive kills for health check failure. This fix propose to address the issue by tracking health status by task id to ensure it to cleaned each time a task is terminated, whatever the reason. JIRA Issues: MARATHON-8745
@@ -40,7 +41,7 @@ private[health] class HealthCheckActor( | |||
implicit val mat = ActorMaterializer() | |||
import context.dispatcher | |||
|
|||
val healthByInstanceId = TrieMap.empty[Instance.Id, Health] | |||
val healthByInstanceId = TrieMap.empty[Task.Id, Health] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like we oughta rename this map?
inactiveInstanceIds.foreach { inactiveId => | ||
healthByInstanceId.remove(inactiveId) | ||
} | ||
val activeTaskIds: Set[Task.Id] = instances.map(_.appTask).filter(_.isActive).map(_.taskId).to(Set) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
val activeTaskIds: Set[Task.Id] = instances.map(_.appTask).filter(_.isActive).map(_.taskId).to(Set) | |
val activeTaskIds: Set[Task.Id] = instances.iterator.map(_.appTask).filter(_.isActive).map(_.taskId).to(Set) |
healthByInstanceId.remove(inactiveId) | ||
} | ||
val activeTaskIds: Set[Task.Id] = instances.map(_.appTask).filter(_.isActive).map(_.taskId).to(Set) | ||
healthByInstanceId.retain((taskId, health) => activeTaskIds(taskId)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👏
@@ -192,7 +191,9 @@ private[health] class HealthCheckActor( | |||
} | |||
|
|||
def receive: Receive = { | |||
case GetInstanceHealth(instanceId) => sender() ! healthByInstanceId.getOrElse(instanceId, Health(instanceId)) | |||
case GetInstanceHealth(instanceId) => | |||
sender() ! healthByInstanceId.find(_._1.instanceId == instanceId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... so we're going to scan a map? This seems like this could potentially perform poorly with lots of health checks. Maybe we should keep it indexed by instance id and evict the health status if the task id changes?
Thanks for the PR! I've reviewed it and left some feedback, I'm most concerned about changing the index and introducing a scan for what was previously a map lookup. Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def update(result: HealthResult): Health =
result match {
case Healthy(_, _, time, _) =>
copy(
firstSuccess = firstSuccess.orElse(Some(time)),
lastSuccess = Some(time),
consecutiveFailures = 0
)
case Unhealthy(_, _, cause, time, _) =>
copy(
lastFailure = Some(time),
lastFailureCause = Some(cause)
)
}
Following the change of task id / instance id format, task health status
was not properly tracked because bound to the same instance id after
successive kills for health check failure. This fix propose to address
the issue by tracking health status by task id to ensure it to cleaned
each time a task is terminated, whatever the reason.
JIRA Issues: MARATHON-8745