Skip to content
This repository has been archived by the owner on Dec 13, 2023. It is now read-only.

HTTP task stuck in SCHEDULED state #3719

Closed
anjkl opened this issue Aug 4, 2023 · 5 comments
Closed

HTTP task stuck in SCHEDULED state #3719

anjkl opened this issue Aug 4, 2023 · 5 comments
Labels
type: bug bugs/ bug fixes

Comments

@anjkl
Copy link

anjkl commented Aug 4, 2023

Describe the bug
HTTP task get stuck in SCHEDULED state when external.payload.storage not configured and payload size is in between Soft Barrier and Hard Barrier.

logs related to the issue:

2023-08-04 08:39:56 DEBUG AsyncSystemTaskExecutor: - Task: TaskModel{taskType='HTTP', status=SCHEDULED, inputData={asyncComplete=false, http_request={method=GET, uri=http://server:8080/api/test, accept=application/json}}, referenceTaskName='get_test_info', retryCount=2, seq=3, correlationId='10234113-4020-4a89-b4fd-a63ae943b203', pollCount=0, taskDefName='get_test_info', scheduledTime=1691138396532, startTime=0, endTime=0, updateTime=1691138396485, startDelayInSeconds=0, retriedTaskId='48c8d4bb-001a-4c4b-8248-2738f9a11e18', retried=false, executed=false, callbackFromWorker=true, responseTimeoutSeconds=0, workflowInstanceId='11e5b8a8-eaff-4fed-9854-db1b2e702044', workflowType='generic_workflow', taskId='d2fe143a-fc8b-49ba-977c-3d5172dfdf9b', reasonForIncompletion='null', callbackAfterSeconds=0, workerId='null', outputData={}, workflowTask=get_test_info/get_test_info, domain='null', waitTimeout='0', inputMessage=null, outputMessage=null, rateLimitPerFrequency=0, rateLimitFrequencyInSeconds=0, externalInputPayloadStoragePath='null', externalOutputPayloadStoragePath='null', workflowPriority=0, executionNameSpace='null', isolationGroupId='null', iteration=0, subWorkflowId='null', subworkflowChanged=false} fetched from execution DAO for taskId: d2fe143a-fc8b-49ba-977c-3d5172dfdf9b
2023-08-04 08:39:56 DEBUG AsyncSystemTaskExecutor: - Executing HTTP/d2fe143a-fc8b-49ba-977c-3d5172dfdf9b in SCHEDULED state
...
...
2023-08-04 08:39:58 DEBUG AsyncSystemTaskExecutor: - Finished execution of HTTP/d2fe143a-fc8b-49ba-977c-3d5172dfdf9b-COMPLETED
2023-08-04 08:39:58 DEBUG SystemTaskWorker: - Polling queue:START_WORKFLOW, got 0 tasks
2023-08-04 08:39:59 ERROR ExternalPayloadStorageUtils: - Unable to upload payload to external storage for workflow: 11e5b8a8-eaff-4fed-9854-db1b2e702044
java.lang.NullPointerException: null
        at com.netflix.conductor.core.utils.ExternalPayloadStorageUtils.uploadHelper(ExternalPayloadStorageUtils.java:210) ~[conductor-core-3.13.7.jar!/:3.13.7]
        at com.netflix.conductor.core.utils.ExternalPayloadStorageUtils.verifyAndUpload(ExternalPayloadStorageUtils.java:164) ~[conductor-core-3.13.7.jar!/:3.13.7]
        at com.netflix.conductor.core.dal.ExecutionDAOFacade.externalizeTaskData(ExecutionDAOFacade.java:271) ~[conductor-core-3.13.7.jar!/:3.13.7]
        at com.netflix.conductor.core.dal.ExecutionDAOFacade.updateTask(ExecutionDAOFacade.java:504) ~[conductor-core-3.13.7.jar!/:3.13.7]
        at com.netflix.conductor.core.execution.AsyncSystemTaskExecutor.execute(AsyncSystemTaskExecutor.java:190) ~[conductor-core-3.13.7.jar!/:3.13.7]
        at com.netflix.conductor.core.execution.tasks.SystemTaskWorker.lambda$pollAndExecute$1(SystemTaskWorker.java:135) ~[conductor-core-3.13.7.jar!/:3.13.7]
        at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        at java.lang.Thread.run(Thread.java:829) ~[?:?]
2023-08-04 08:39:59 DEBUG SystemTaskWorker: - Polling queue: START_WORKFLOW with 4 slots acquired
2023-08-04 08:39:59 DEBUG SystemTaskWorker: - Polling queue:START_WORKFLOW, got 0 tasks
2023-08-04 08:39:59 DEBUG SystemTaskWorker: - Polling queue: KAFKA_PUBLISH with 4 slots acquired
2023-08-04 08:39:59 DEBUG SystemTaskWorker: - Polling queue:KAFKA_PUBLISH, got 0 tasks
...

Details
Conductor version: 3.13.7
Persistence implementation: MySQ
Queue implementation: MySQL
Lock: no

To Reproduce
Steps to reproduce the behavior:

  • Create some http server that generates a big response. For example
cd $(mktemp -d); python3 -c 'print("a" * 5 * 1024 * 1024)' > 1; python3 -m http.server 8888
  • Create a simple workflow with a single HTTP task
{
  "accessPolicy": {},
  "name": "stuck_http_test",
  "description": "test",
  "version": 1,
  "tasks": [
    {
      "name": "get_test_info",
      "taskReferenceName": "get_test_info",
      "inputParameters": {
        "http_request": {
          "uri": "http://localhost:8888/1",
          "method": "GET",
          "accept": "application/json"
        },
        "asyncComplete": false
      },
      "type": "HTTP",
      "startDelay": 0,
      "optional": true,
      "asyncComplete": false
    }
  ],
  "inputParameters": [],
  "outputParameters": {},
  "schemaVersion": 2,
  "restartable": true,
  "workflowStatusListenerEnabled": false,
  "ownerEmail": "[email protected]",
  "timeoutPolicy": "ALERT_ONLY",
  "timeoutSeconds": 0,
  "variables": {},
  "inputTemplate": {}
}
  • Run workflow
curl 'http://localhost:8080/api/workflow/stuck_http_test' -X POST -H 'Content-Type: application/json' --data-raw '{}'
  • Check workflow status and conductor logs (task stuck, you'll see NullPointerException for ExternalPayloadStorageUtils in conductor logs)

Expected behavior
HTTP task failed

Probably related to #1153

@anjkl anjkl added the type: bug bugs/ bug fixes label Aug 4, 2023
@anjkl
Copy link
Author

anjkl commented Aug 7, 2023

Few additional notes. The similar issue can be reproduced by other kind of tasks like INLINE task.

In case of INLINE task it may different results but the root cause is the same (NullPointerException while uploading payload to external storage):

  • task get stuck in IN_PROGRESS state
  • workflow start fails with error:
{
  "status": 500,
  "message": "Unable to upload payload to external storage for workflow: 2031725a-b1ff-4794-828f-7f7de8703b84",
  "instance": "laptop",
  "retryable": false
}

Example of INLINE tasks that reproduce the same bug but with different results:

  • workflow start fails with 500 error
{
  "accessPolicy": {},
  "name": "test_workflow_execution_error",
  "version": 1,
  "tasks": [
    {
      "name": "test",
      "taskReferenceName": "test",
      "inputParameters": {
        "evaluatorType": "javascript",
        "expression": "'a'.repeat(${workflow.input.repeats});"
      },
      "type": "INLINE",
      "startDelay": 0,
      "optional": true,
      "asyncComplete": false
    }
  ],
  "inputParameters": [
    "repeats"
  ],
  "outputParameters": {},
  "schemaVersion": 2,
  "ownerEmail": "[email protected]",
  "timeoutPolicy": "ALERT_ONLY",
  "timeoutSeconds": 0
}
  • task get stuck in IN_PROGRESS state
{
  "accessPolicy": {},
  "name": "test_inline_task_stuck_in_progress",
  "version": 1,
  "tasks": [
    {
      "name": "wait",
      "taskReferenceName": "wait",
      "inputParameters": {
        "duration": "30s"
      },
      "type": "WAIT",
      "startDelay": 0,
      "optional": false,
      "asyncComplete": false
    },
    {
      "name": "test",
      "taskReferenceName": "test",
      "inputParameters": {
        "evaluatorType": "javascript",
        "expression": "'a'.repeat(${workflow.input.repeats});"
      },
      "type": "INLINE",
      "startDelay": 0,
      "optional": true,
      "asyncComplete": false
    }
  ],
  "inputParameters": [
    "repeats"
  ],
  "outputParameters": {},
  "schemaVersion": 2,
  "ownerEmail": "[email protected]",
  "timeoutPolicy": "ALERT_ONLY",
  "timeoutSeconds": 0
}

To trigger workflow with the issue use input:

{
  "repeats": 7096000
}

@anjkl
Copy link
Author

anjkl commented Aug 7, 2023

And workaround is setting the barriers to the same value:

conductor:
  app:
    workflowInputPayloadSizeThreshold: 5120
    maxWorkflowInputPayloadSizeThreshold: 5120
    workflowOutputPayloadSizeThreshold: 5120
    maxWorkflowOutputPayloadSizeThreshold: 5120
    taskInputPayloadSizeThreshold: 3072
    maxTaskInputPayloadSizeThreshold: 3072
    taskOutputPayloadSizeThreshold: 3072
    maxTaskOutputPayloadSizeThreshold: 3072

@saksham2105
Copy link
Contributor

saksham2105 commented Sep 28, 2023

This Commit should fix this, I think this should work now, You can let us know if it doesn't work for you

@anjkl
Copy link
Author

anjkl commented Sep 29, 2023

Thanks a lot for the fix.

Just a note: it may lead to unexpected behavior in case of running few instances of conductor with DummyPayloadStorage. So it shouldn't be run on production. But probably it's OK for dummy storage.

@anjkl anjkl closed this as completed Sep 29, 2023
@saksham2105
Copy link
Contributor

Yeah We have a test env property which is creating conditional bean for DummyPayloadStorage, for production we have different storage classes like S3 etc.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
type: bug bugs/ bug fixes
Projects
None yet
Development

No branches or pull requests

2 participants