Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Workflows getting stuck after cancellation (spurious WorkflowTaskFailed errors) #1580

Open
tetrakatech opened this issue Dec 4, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@tetrakatech
Copy link

What are you really trying to do?

We have workflows that are routinely cancelled by other (orchestrator) workflows. Usually this works perfectly, and allows our system to implement preemption in line with our business requirements.

Sporadically - a few times a day, out of at least thousands of workflows - the child workflow is failing to cancel correctly, and is instead stalling in a running state, leading to ~infinite retries and blocking of the orchestrator workflow.

Describe the bug

Typically, when cancellation occurs we see a WorkflowTaskCompleted event in the Temporal UI, followed by a WorkflowExecutionCanceledEvent.

image

In the buggy cases, we instead see two instances of WorkflowTaskFailed, with the call stack being a CancelledFailure inside the Temporal SDK, and the workflow left stuck in a Running state:
image

Logs for a buggy case run ID:

2024-11-28 00:54:22.293 Workflow started
2024-11-28 00:54:22.351 Activity started
2024-11-28 00:54:23.030 Activity completed
2024-11-28 00:54:23.700 Workflow completed as cancelled
2024-11-28 00:54:23.700 Failing workflow task
2024-11-28 00:54:24.105 Workflow completed as cancelled
2024-11-28 00:54:24.108 Failing workflow task
2024-11-28 00:54:24.193 {"component":"matching-engine", "error":"Workflow task not found.", "level":"info", "logging-call-at":"matching_engine.go:613", "msg":"Workflow task not found", "queue-task-id":-137, "queue-task-visibility-timestamp":"2024-11-27T13:54:24.180Z", "ts":"2024-11-27T13:54:24.193Z", "wf-history-event-id":16, "wf-id":"9ee39789-3ca2-4fc3-af5c-8bd24a7bbe28-1-378bf7e7-bc20-4b26-b459-c9e7633aa5c6", "wf-namespace-id":"5e9a725c-d677-4281-8b07-9cd4097060bd", "wf-run-id":"dd593d4f-3f24-4177-b376-ba022ebb5d10", "wf-task-queue-name":"chatExecutor"}
2024-11-28 00:54:24.290999889 Workflow completed as cancelled
2024-11-28 00:54:34.270999908 Workflow completed as cancelled
2024-11-28 00:54:44.298000097 Workflow completed as cancelled
2024-11-28 00:54:59.002000093 Workflow completed as cancelled
2024-11-28 00:55:18.104000091 Workflow completed as cancelled
2024-11-28 00:55:46.848999977 Workflow completed as cancelled
2024-11-28 00:56:30.530999898 Workflow completed as cancelled
2024-11-28 00:57:55.980000019 Workflow completed as cancelled
2024-11-28 01:00:26.433630347 {"address":"10.20.12.44:7234", "attempt":10, "level":"warn", "logging-call-at":"workflow_task_state_machine.go:1061", "msg":"Critical attempts processing workflow task", "service":"history", "shard-id":130, "ts":"2024-11-27T14:00:26.433Z", "wf-id":"9ee39789-3ca2-4fc3-af5c-8bd24a7bbe28-1-378bf7e7-bc20-4b26-b459-c9e7633aa5c6", "wf-namespace":"default", "wf-run-id":"dd593d4f-3f24-4177-b376-ba022ebb5d10"}
2024-11-28 01:00:26.480000019 Workflow completed as cancelled
2024-11-28 01:05:07.210610845 {"address":"10.20.12.44:7234", "attempt":11, "level":"warn", "logging-call-at":"workflow_task_state_machine.go:1061", "msg":"Critical attempts processing workflow task", "service":"history", "shard-id":130, "ts":"2024-11-27T14:05:07.209Z", "wf-id":"9ee39789-3ca2-4fc3-af5c-8bd24a7bbe28-1-378bf7e7-bc20-4b26-b459-c9e7633aa5c6", "wf-namespace":"default", "wf-run-id":"dd593d4f-3f24-4177-b376-ba022ebb5d10"}

and then progressively more retries that are backed off.

Minimal Reproduction

I've attached a (redacted) copy of the JSON events for one of the buggy cases. I'm hoping this is sufficient to repro, since the issue repeats on workflow re-attempts, and the example history is quite short (< 20 events).
dd593d4f-3f24-4177-b376-ba022ebb5d10_events.json

The workflow implementation to hit this case is trivially simple:

const { handleCustomerMessage } =
  proxyActivities<IHandleCustomerMessageActivity>({
  startToCloseTimeout: '1 minute',
  retry: {
    maximumAttempts: 3,
  },
  cancellationType: ActivityCancellationType.TRY_CANCEL,
})

export async function processChatTicket({
  data,
}: {
  data: ProcessTicketInput
}): Promise<ProcessTicketOutput> {
  try {
    const temporalWorkflowId = workflowInfo().workflowId

    const handleCustomerMessageOutput = await handleCustomerMessage({
      temporalWorkflowId,
      data,
    })

    // More logic below...
   } catch (error) {
    // If the error is just a cancellation, or a termination, just re-throw
    // immediately. The former is expected (e.g. a parent will cancel this
    // workflow if a new message is received; the latter is triggered manually
    // in the UI by a human in very limited circumstances)
    if (isCancellation(error) || error instanceof TerminatedFailure) {
      throw error
    }

    // Otherwise, kick off a "cleanup" workflow
    await startChild(cleanup, {
      // Other arguments
      parentClosePolicy: ParentClosePolicy.PARENT_CLOSE_POLICY_ABANDON,
    })

    throw error
  }
}

Environment/Versions

  • OS and processor: Kubernetes running node:lts-bookworm-slim (workers)
  • Temporal Version: Kubernetes running temporalio/server:1.24.1 (server), Typescript SDK 1.11.3 (workers)
  • Are you using Docker or Kubernetes or building Temporal from source? Kubernetes

Additional context

Based on the log lines, it looks to me like something is getting confused between the Typescript and Core SDKs as to whether the cancellation error thrown represents a WorkflowTask failure or not.

Notably, the logs around the task cancellation in Typescript are run in both buggy/working cases:

log.debug('Workflow completed as cancelled', { sdkComponent: SdkComponent.worker });

But the log around the workflow task failing is over here, https://github.com/temporalio/sdk-core/blob/4a2368d19f57e971ca9b2465f1dbeede7a861c34/core/src/worker/workflow/mod.rs#L423, and is only called in the buggy case.

With my limited knowledge of how this code works, is it possible that this.cancelled (https://github.com/temporalio/sdk-typescript/blob/main/packages/workflow/src/internals.ts#L945) isn't being set correctly in all cases, especially when an Activity runs and successfully completes in parallel with a ActivityTaskCancelRequested event being generated? If that were the case, the Typescript SDK would I think indicate a workflow failure on cancellation, rather than cancelling cleanly, but would still emit the logs we're seeing.

Happy to provide any other details if they'd be useful :)

@tetrakatech tetrakatech added the bug Something isn't working label Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant