You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have workflows that are routinely cancelled by other (orchestrator) workflows. Usually this works perfectly, and allows our system to implement preemption in line with our business requirements.
Sporadically - a few times a day, out of at least thousands of workflows - the child workflow is failing to cancel correctly, and is instead stalling in a running state, leading to ~infinite retries and blocking of the orchestrator workflow.
Describe the bug
Typically, when cancellation occurs we see a WorkflowTaskCompleted event in the Temporal UI, followed by a WorkflowExecutionCanceledEvent.
In the buggy cases, we instead see two instances of WorkflowTaskFailed, with the call stack being a CancelledFailure inside the Temporal SDK, and the workflow left stuck in a Running state:
Logs for a buggy case run ID:
2024-11-28 00:54:22.293 Workflow started
2024-11-28 00:54:22.351 Activity started
2024-11-28 00:54:23.030 Activity completed
2024-11-28 00:54:23.700 Workflow completed as cancelled
2024-11-28 00:54:23.700 Failing workflow task
2024-11-28 00:54:24.105 Workflow completed as cancelled
2024-11-28 00:54:24.108 Failing workflow task
2024-11-28 00:54:24.193 {"component":"matching-engine", "error":"Workflow task not found.", "level":"info", "logging-call-at":"matching_engine.go:613", "msg":"Workflow task not found", "queue-task-id":-137, "queue-task-visibility-timestamp":"2024-11-27T13:54:24.180Z", "ts":"2024-11-27T13:54:24.193Z", "wf-history-event-id":16, "wf-id":"9ee39789-3ca2-4fc3-af5c-8bd24a7bbe28-1-378bf7e7-bc20-4b26-b459-c9e7633aa5c6", "wf-namespace-id":"5e9a725c-d677-4281-8b07-9cd4097060bd", "wf-run-id":"dd593d4f-3f24-4177-b376-ba022ebb5d10", "wf-task-queue-name":"chatExecutor"}
2024-11-28 00:54:24.290999889 Workflow completed as cancelled
2024-11-28 00:54:34.270999908 Workflow completed as cancelled
2024-11-28 00:54:44.298000097 Workflow completed as cancelled
2024-11-28 00:54:59.002000093 Workflow completed as cancelled
2024-11-28 00:55:18.104000091 Workflow completed as cancelled
2024-11-28 00:55:46.848999977 Workflow completed as cancelled
2024-11-28 00:56:30.530999898 Workflow completed as cancelled
2024-11-28 00:57:55.980000019 Workflow completed as cancelled
2024-11-28 01:00:26.433630347 {"address":"10.20.12.44:7234", "attempt":10, "level":"warn", "logging-call-at":"workflow_task_state_machine.go:1061", "msg":"Critical attempts processing workflow task", "service":"history", "shard-id":130, "ts":"2024-11-27T14:00:26.433Z", "wf-id":"9ee39789-3ca2-4fc3-af5c-8bd24a7bbe28-1-378bf7e7-bc20-4b26-b459-c9e7633aa5c6", "wf-namespace":"default", "wf-run-id":"dd593d4f-3f24-4177-b376-ba022ebb5d10"}
2024-11-28 01:00:26.480000019 Workflow completed as cancelled
2024-11-28 01:05:07.210610845 {"address":"10.20.12.44:7234", "attempt":11, "level":"warn", "logging-call-at":"workflow_task_state_machine.go:1061", "msg":"Critical attempts processing workflow task", "service":"history", "shard-id":130, "ts":"2024-11-27T14:05:07.209Z", "wf-id":"9ee39789-3ca2-4fc3-af5c-8bd24a7bbe28-1-378bf7e7-bc20-4b26-b459-c9e7633aa5c6", "wf-namespace":"default", "wf-run-id":"dd593d4f-3f24-4177-b376-ba022ebb5d10"}
and then progressively more retries that are backed off.
Minimal Reproduction
I've attached a (redacted) copy of the JSON events for one of the buggy cases. I'm hoping this is sufficient to repro, since the issue repeats on workflow re-attempts, and the example history is quite short (< 20 events). dd593d4f-3f24-4177-b376-ba022ebb5d10_events.json
The workflow implementation to hit this case is trivially simple:
const{ handleCustomerMessage }=proxyActivities<IHandleCustomerMessageActivity>({startToCloseTimeout: '1 minute',retry: {maximumAttempts: 3,},cancellationType: ActivityCancellationType.TRY_CANCEL,})exportasyncfunctionprocessChatTicket({
data,}: {data: ProcessTicketInput}): Promise<ProcessTicketOutput>{try{consttemporalWorkflowId=workflowInfo().workflowIdconsthandleCustomerMessageOutput=awaithandleCustomerMessage({
temporalWorkflowId,
data,})// More logic below...}catch(error){// If the error is just a cancellation, or a termination, just re-throw// immediately. The former is expected (e.g. a parent will cancel this// workflow if a new message is received; the latter is triggered manually// in the UI by a human in very limited circumstances)if(isCancellation(error)||errorinstanceofTerminatedFailure){throwerror}// Otherwise, kick off a "cleanup" workflowawaitstartChild(cleanup,{// Other argumentsparentClosePolicy: ParentClosePolicy.PARENT_CLOSE_POLICY_ABANDON,})throwerror}}
Environment/Versions
OS and processor: Kubernetes running node:lts-bookworm-slim (workers)
Are you using Docker or Kubernetes or building Temporal from source? Kubernetes
Additional context
Based on the log lines, it looks to me like something is getting confused between the Typescript and Core SDKs as to whether the cancellation error thrown represents a WorkflowTask failure or not.
Notably, the logs around the task cancellation in Typescript are run in both buggy/working cases:
With my limited knowledge of how this code works, is it possible that this.cancelled (https://github.com/temporalio/sdk-typescript/blob/main/packages/workflow/src/internals.ts#L945) isn't being set correctly in all cases, especially when an Activity runs and successfully completes in parallel with a ActivityTaskCancelRequested event being generated? If that were the case, the Typescript SDK would I think indicate a workflow failure on cancellation, rather than cancelling cleanly, but would still emit the logs we're seeing.
Happy to provide any other details if they'd be useful :)
The text was updated successfully, but these errors were encountered:
What are you really trying to do?
We have workflows that are routinely cancelled by other (orchestrator) workflows. Usually this works perfectly, and allows our system to implement preemption in line with our business requirements.
Sporadically - a few times a day, out of at least thousands of workflows - the child workflow is failing to cancel correctly, and is instead stalling in a running state, leading to ~infinite retries and blocking of the orchestrator workflow.
Describe the bug
Typically, when cancellation occurs we see a WorkflowTaskCompleted event in the Temporal UI, followed by a WorkflowExecutionCanceledEvent.
In the buggy cases, we instead see two instances of WorkflowTaskFailed, with the call stack being a
CancelledFailure
inside the Temporal SDK, and the workflow left stuck in a Running state:Logs for a buggy case run ID:
and then progressively more retries that are backed off.
Minimal Reproduction
I've attached a (redacted) copy of the JSON events for one of the buggy cases. I'm hoping this is sufficient to repro, since the issue repeats on workflow re-attempts, and the example history is quite short (< 20 events).
dd593d4f-3f24-4177-b376-ba022ebb5d10_events.json
The workflow implementation to hit this case is trivially simple:
Environment/Versions
node:lts-bookworm-slim
(workers)temporalio/server:1.24.1
(server), Typescript SDK1.11.3
(workers)Additional context
Based on the log lines, it looks to me like something is getting confused between the Typescript and Core SDKs as to whether the cancellation error thrown represents a WorkflowTask failure or not.
Notably, the logs around the task cancellation in Typescript are run in both buggy/working cases:
sdk-typescript/packages/workflow/src/logs.ts
Line 99 in be37851
But the log around the workflow task failing is over here, https://github.com/temporalio/sdk-core/blob/4a2368d19f57e971ca9b2465f1dbeede7a861c34/core/src/worker/workflow/mod.rs#L423, and is only called in the buggy case.
With my limited knowledge of how this code works, is it possible that
this.cancelled
(https://github.com/temporalio/sdk-typescript/blob/main/packages/workflow/src/internals.ts#L945) isn't being set correctly in all cases, especially when an Activity runs and successfully completes in parallel with a ActivityTaskCancelRequested event being generated? If that were the case, the Typescript SDK would I think indicate a workflow failure on cancellation, rather than cancelling cleanly, but would still emit the logs we're seeing.Happy to provide any other details if they'd be useful :)
The text was updated successfully, but these errors were encountered: