Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle system time move backward case in timer task processing #7030

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

yycptt
Copy link
Member

@yycptt yycptt commented Dec 21, 2024

What changed?

  • Handle system time move backward case in timer task processing by using max(now(), task visibility timestamp)

Why?

  • Monotonic time is used in time.Time comparison only when both operands have monotonic time value. In our case, the timestamp stored/derived in/from mutable state doesn't have monotonic, thus the comparison logic will use wall clock time to decide if a timer task should be processed. If wall clock move backwards (after we verifies that now() > task visibility timestamp when submitting the task to the task scheduler), the timer task will be dropped and cause workflow to stuck.

How did you test it?

  • Unit tests

Potential risks

  • In worst case, we will execute a timer task logic earlier than expected.

Documentation

Is hotfix candidate?

  • Could be. But the bug should be very rare as the use millisecond precision when doing time comparison in timer task processing.

@yycptt yycptt requested a review from bergundy December 21, 2024 02:22
@yycptt yycptt requested a review from a team as a code owner December 21, 2024 02:22
@yycptt yycptt force-pushed the fix-queues-is-time-expired branch from fb5f5fd to c3f87c1 Compare December 21, 2024 02:24
service/history/queues/queue_scheduled.go Show resolved Hide resolved
// NOTE: Persistence layer may lose precision when persisting the task, which essentially moves
// task fire time backward. But we are already performing truncation here, so doesn't need to
// account for that.
referenceTime = util.MaxTime(referenceTime, task.GetKey().FireTime).Truncate(persistence.ScheduledTaskMinPrecision)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between visibility timestamp and fire time and why take that here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are the same for Scheduled tasks. For immediate tasks, fire time is always 0. Since this function is only meant to be invoked by scheduled tasks, I think FireTime is better as it won't have any impact on immediate tasks.

@@ -236,7 +240,7 @@ func (t *timerQueueStandbyTaskExecutor) executeActivityTimeoutTask(
// created.
isHeartBeatTask := timerTask.TimeoutType == enumspb.TIMEOUT_TYPE_HEARTBEAT
ai, heartbeatTimeoutVis, ok := mutableState.GetActivityInfoWithTimerHeartbeat(timerTask.EventID)
if isHeartBeatTask && ok && queues.IsTimeExpired(timerTask.GetVisibilityTime(), heartbeatTimeoutVis) {
if isHeartBeatTask && ok && queues.IsTimeExpired(timerTask, timerTask.GetVisibilityTime(), heartbeatTimeoutVis) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we use referenceTime to allow progressing beyond the visibility timestamp in case task processing is delayed for a while?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal here I believe is to clear the heartbeat timer task status only when we have to. Using referenceTime means we will clear the flag more often and causing more heartbeat timer task being created.

@yycptt yycptt requested a review from bergundy January 6, 2025 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants