You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Today, we use a rather blunt means for restarting failed shards (every five minutes, we un-assign any shard which is currently FAILED).
This is simultaneously too large a delay for a first flaky failure of an otherwise healthy task, and too short a delay for a task which only ever errors. We'd like to instead track failures of a task over time and use a more graduated back-off if it continues to fail after successive restarts, likely ultimately disabling the task automatically after sustained failure.
We also now have multiple data-planes, and we want a consolidated mechanism for managing shard failures across all data-planes.
The text was updated successfully, but these errors were encountered:
The runtime invokes a new /notify/shard-failure control-plane API which
is told of shard failures that have occurred within a data-plane.
At the moment, this API verifies the data-plane token and logs the
failure, but takes no further action.
Update the taskBase.heartbeatLoop() to perform this notification if the
shard's primary loop exits with a non-cancellation error.
Issue #1666
Today, we use a rather blunt means for restarting failed shards (every five minutes, we un-assign any shard which is currently FAILED).
This is simultaneously too large a delay for a first flaky failure of an otherwise healthy task, and too short a delay for a task which only ever errors. We'd like to instead track failures of a task over time and use a more graduated back-off if it continues to fail after successive restarts, likely ultimately disabling the task automatically after sustained failure.
We also now have multiple data-planes, and we want a consolidated mechanism for managing shard failures across all data-planes.
The text was updated successfully, but these errors were encountered: