Performance of the Netflix Conductor(In Memory) #3034

npk1994 · 2022-06-07T14:34:04Z

npk1994
Jun 7, 2022

Hi Team, we have deployed the conductor server in Kuberenetes setup and we have configured a sample workflow with 2 http tasks. (Both the tasks hits the service deployed in the same environment). We are not using the persistent layers (in memory) with Conductor 3.8.0
Below is the configuration of our workflow

{
    "name": "sample-wf",
    "description": "get the country details with code",
    "version": 2,
    "schemaVersion": 2,
	"ownerEmail": "",
    "tasks": [
        {
           "name":"sample-getcall-1",
           "taskReferenceName":"sample-task-post-1",
           "type":"HTTP",
           "inputParameters":{
              "http_request":{
                 "uri":"http://sample-task1:8080/sample-task1/v1/api/task1/task1",
                 "method":"GET"
              }
           }
       },
        {
           "name":"sample-post-call-1",
           "taskReferenceName":"sample-task-post-2",
           "type":"HTTP",
           "inputParameters":{
              "http_request":{
                 "uri":"http://sample-task1:8080/sample-task1/v1/api/task1/task2",
                 "method":"POST",
				 "body": "${sample-task-post-1.output.response.body}"
              }
           }
       }	   
	   
    ]
     }
}

Overall workflow execution is taking 591ms. First and second task took 107ms and 86ms. The conductor is taking 400ms. Timegap between End of Task1 and Start of the task2 is around 200ms.
I have added server Log file from the beginning of the workflow.

Is this normal time delay expected between the tasks. Is there anyway to bring this time down. Would like to hear your suggestions.

server-wf-execution-log-timestmp.txt

npk1994 · 2022-06-07T14:49:33Z

npk1994
Jun 7, 2022
Author

Adding to this,
While we are trying to dig into what is causing this, would like understand from the community on this timings from their implementations.
We are trying to use Conductor for a use case where there will be a bunch for 7-8 micro-services invoked (mostly sequentially). All these services put together should complete in around 300ms. We are not in a position to take anything more than 20-30ms over this as additional processing time from conductor. But from what we are currently seeing, Conductor is adding more time than the actual service itself. For this exercise, we are not using persistence, this is all in-memory. And when we added persistence, there is an increase of 100ms to the overall time.
Adding screenshot of the configurations and the execution.

0 replies

boney9 · 2022-06-07T19:51:24Z

boney9
Jun 7, 2022

Hi @npk1994 - We would love to help. This might require analysis of your configurations. Technically Conductor can schedule and run the next task in sub millis. Here is a screen grab of the same flow running in one of our test environments.

Some of the settings you can change is the polling intervals for tasks. You can also increase the number of system workers handling the HTTP tasks.

1 reply

krishnapyde Jun 9, 2022

Is there a list of tuning parameters reference guide or something that we can look up to?
When you say 'increase the number of system workers handling the HTTP tasks', is this a parameter setting, can you help suggest which parameter are you referring to?

npk1994 · 2022-06-08T07:39:55Z

npk1994
Jun 8, 2022
Author

Hi @boney9, we are using the default OOB configuration as of now. We would like to understand what are the configurations can be added to reduce this time delay.
Thank you!

0 replies

krishnapyde · 2022-06-09T10:08:38Z

krishnapyde
Jun 9, 2022

Consider this use case.
Service A calls Service B which then calls Service C.
Assuming each service takes 100ms, the total time to complete the flow will be 100*3 = 300ms + Network latency (~10ms).

But, when we set up this as a workflow in Conductor, the total time to complete the flow is around 1 second. There is a lot of time lost inside the Conductor.
Our use case demands that the flow is processed in and around 300ms. and we will have 10,000 executions of a workflow per second.
There are multiple discussions on this the topic of lost time in the Conductor on the Discussion board, but none of them have any reference to a comprehensive list of tuning parameters or a tuning guide. We are unsure of how to proceed here.

I am unsure if Conductor is for real-time applications (for use cases mentioned above). I may be wrong, but at this point, I am unable to proceed with the conductor.

If there are any references on implementation patterns, do's / don'ts, on how to build a workflow and setup Conductor with HTTP tasks, with minimum possible extra time added by Conductor (not more than 5%-10% over the time taken by HTTP tasks themselves).

2 replies

manan164 Jun 9, 2022

Hi @krishnapyde , Conductor is best suited for real time orchestration use case. We are testing millions of workflows per day in our test environment. Please go through these properties,
Basically what you are facing is either systemTask worker polling interval is more or you are having very low systemTask worker thread count. Happy to hope on call with you and understand the requirement.

krishnapyde Jun 10, 2022

Thanks, @manan164 , let me scan through the list and revert.

aravindanr · 2022-06-10T21:36:34Z

aravindanr
Jun 10, 2022
Maintainer

@npk1994 @krishnapyde In your workflow, both sample-task-post-1 and sample-task-post-2 are HTTP tasks. HTTP tasks are placed in a queue and there is a worker SystemTaskWorker which polls and executes those tasks. I recommend that you check the task_queue_wait metric. If the tasks are in the queue way longer than 50 milliseconds (default poll interval), consider increasing the value of conductor.app.system-task-worker-thread-count property.

You can also reduce the poll interval further using conductor-app.system-task-worker-poll-interval. Note that reducing the poll interval could adversely affect CPU usage.

5 replies

npk1994 Jun 29, 2022
Author

@manan164 @aravindanr , We tried below property changes. Not helping much.

conductor.app.systemTaskWorkerPollInterval=1ms
conductor.app.systemTaskWorkerThreadCount=16

Below is our resource configuration on the Kubernetes pods

Memory: 
      Request: 4Gi
      Limit: 4Gi
CPU:
     Request:4
     Limit:4

dangjianguo123 Aug 4, 2022

@manan164 @aravindanr , We tried below property changes. Not helping much.
conductor.app.systemTaskWorkerPollInterval=1ms
conductor.app.systemTaskWorkerThreadCount=16
Below is our resource configuration on the Kubernetes pods
Memory: 
      Request: 4Gi
      Limit: 4Gi
CPU:
     Request:4
     Limit:4

hello，Is this problem solved? I also had the same problem

Sunyelw Aug 9, 2022

I also have this type problem, but this is simple
see issue 3162

npk1994 Aug 9, 2022
Author

Hi @dangjianguo123 , not yet. But Orkes Team(@manan164) is coming up with new release which will address real time use cases in 3-4 weeks which should address these issues.

lianjunwei Aug 27, 2022

Hi @dangjianguo123 , not yet. But Orkes Team(@manan164) is coming up with new release which will address real time use cases in 3-4 weeks which should address these issues.
Is your problem solved?I also have this type problem, but this is simple task.
see #3206

flavioschuindt · 2022-12-15T02:28:30Z

flavioschuindt
Dec 15, 2022

I would like to shed some light in this discussion. I am facing the same issue and I believe I found something that might explain the root cause, but would like to discuss here and confirm it. I am using conductor 2.30.3, but looking at the code, regardless of the version this should happen.

I created a simple workflow very similar to the one that @npk1994 created. Only difference is that mine has 6 http tasks in sequence while the one from @npk1994 has only two. Here is my workflow:

{
  "createTime": 1670456718043,
  "name": "test-parent-flat",
  "description": "test-parent-flat",
  "version": 1,
  "tasks": [
    {
      "name": "test-parent-flat-task1",
      "taskReferenceName": "test-parent-flat-task1",
      "inputParameters": {
        "http_request": {
          "uri": "http://localhost:8443/api/health",
          "method": "GET",
          "accept": "application/json"
        }
      },
      "type": "HTTP",
      "decisionCases": {},
      "defaultCase": [],
      "forkTasks": [],
      "startDelay": 0,
      "joinOn": [],
      "optional": false,
      "taskDefinition": {
        "name": "test-parent-flat-task1",
        "retryCount": 0,
        "timeoutSeconds": 1200,
        "inputKeys": [],
        "outputKeys": [],
        "timeoutPolicy": "TIME_OUT_WF",
        "retryLogic": "FIXED",
        "retryDelaySeconds": 5,
        "responseTimeoutSeconds": 1200,
        "inputTemplate": {},
        "rateLimitPerFrequency": 0,
        "rateLimitFrequencyInSeconds": 1
      },
      "defaultExclusiveJoinTask": [],
      "asyncComplete": false,
      "loopOver": []
    },
    {
      "name": "test-parent-flat-task2",
      "taskReferenceName": "test-parent-flat-task2",
      "inputParameters": {
        "http_request": {
          "uri": "http://localhost:8443/api/health",
          "method": "GET",
          "accept": "application/json"
        }
      },
      "type": "HTTP",
      "decisionCases": {},
      "defaultCase": [],
      "forkTasks": [],
      "startDelay": 0,
      "joinOn": [],
      "optional": false,
      "taskDefinition": {
        "name": "test-parent-flat-task2",
        "retryCount": 0,
        "timeoutSeconds": 1200,
        "inputKeys": [],
        "outputKeys": [],
        "timeoutPolicy": "TIME_OUT_WF",
        "retryLogic": "FIXED",
        "retryDelaySeconds": 5,
        "responseTimeoutSeconds": 1200,
        "inputTemplate": {},
        "rateLimitPerFrequency": 0,
        "rateLimitFrequencyInSeconds": 1
      },
      "defaultExclusiveJoinTask": [],
      "asyncComplete": false,
      "loopOver": []
    },
    {
      "name": "test-parent-flat-task3",
      "taskReferenceName": "test-parent-flat-task3",
      "inputParameters": {
        "http_request": {
          "uri": "http://localhost:8443/api/health",
          "method": "GET",
          "accept": "application/json"
        }
      },
      "type": "HTTP",
      "decisionCases": {},
      "defaultCase": [],
      "forkTasks": [],
      "startDelay": 0,
      "joinOn": [],
      "optional": false,
      "taskDefinition": {
        "name": "test-parent-flat-task3",
        "retryCount": 0,
        "timeoutSeconds": 1200,
        "inputKeys": [],
        "outputKeys": [],
        "timeoutPolicy": "TIME_OUT_WF",
        "retryLogic": "FIXED",
        "retryDelaySeconds": 5,
        "responseTimeoutSeconds": 1200,
        "inputTemplate": {},
        "rateLimitPerFrequency": 0,
        "rateLimitFrequencyInSeconds": 1
      },
      "defaultExclusiveJoinTask": [],
      "asyncComplete": false,
      "loopOver": []
    },
    {
      "name": "test-parent-flat-task4",
      "taskReferenceName": "test-parent-flat-task4",
      "inputParameters": {
        "http_request": {
          "uri": "http://localhost:8443/api/health",
          "method": "GET",
          "accept": "application/json"
        }
      },
      "type": "HTTP",
      "decisionCases": {},
      "defaultCase": [],
      "forkTasks": [],
      "startDelay": 0,
      "joinOn": [],
      "optional": false,
      "taskDefinition": {
        "name": "test-parent-flat-task4",
        "retryCount": 0,
        "timeoutSeconds": 1200,
        "inputKeys": [],
        "outputKeys": [],
        "timeoutPolicy": "TIME_OUT_WF",
        "retryLogic": "FIXED",
        "retryDelaySeconds": 5,
        "responseTimeoutSeconds": 1200,
        "inputTemplate": {},
        "rateLimitPerFrequency": 0,
        "rateLimitFrequencyInSeconds": 1
      },
      "defaultExclusiveJoinTask": [],
      "asyncComplete": false,
      "loopOver": []
    },
    {
      "name": "test-parent-flat-task5",
      "taskReferenceName": "test-parent-flat-task5",
      "inputParameters": {
        "http_request": {
          "uri": "http://localhost:8443/api/health",
          "method": "GET",
          "accept": "application/json"
        }
      },
      "type": "HTTP",
      "decisionCases": {},
      "defaultCase": [],
      "forkTasks": [],
      "startDelay": 0,
      "joinOn": [],
      "optional": false,
      "taskDefinition": {
        "name": "test-parent-flat-task5",
        "retryCount": 0,
        "timeoutSeconds": 1200,
        "inputKeys": [],
        "outputKeys": [],
        "timeoutPolicy": "TIME_OUT_WF",
        "retryLogic": "FIXED",
        "retryDelaySeconds": 5,
        "responseTimeoutSeconds": 1200,
        "inputTemplate": {},
        "rateLimitPerFrequency": 0,
        "rateLimitFrequencyInSeconds": 1
      },
      "defaultExclusiveJoinTask": [],
      "asyncComplete": false,
      "loopOver": []
    },
    {
      "name": "test-parent-flat-task6",
      "taskReferenceName": "test-parent-flat-task6",
      "inputParameters": {
        "http_request": {
          "uri": "http://localhost:8443/api/health",
          "method": "GET",
          "accept": "application/json"
        }
      },
      "type": "HTTP",
      "decisionCases": {},
      "defaultCase": [],
      "forkTasks": [],
      "startDelay": 0,
      "joinOn": [],
      "optional": false,
      "taskDefinition": {
        "name": "test-parent-flat-task6",
        "retryCount": 0,
        "timeoutSeconds": 1200,
        "inputKeys": [],
        "outputKeys": [],
        "timeoutPolicy": "TIME_OUT_WF",
        "retryLogic": "FIXED",
        "retryDelaySeconds": 5,
        "responseTimeoutSeconds": 1200,
        "inputTemplate": {},
        "rateLimitPerFrequency": 0,
        "rateLimitFrequencyInSeconds": 1
      },
      "defaultExclusiveJoinTask": [],
      "asyncComplete": false,
      "loopOver": []
    }
  ],
  "inputParameters": [],
  "outputParameters": {},
  "schemaVersion": 2,
  "restartable": true,
  "workflowStatusListenerEnabled": true,
  "ownerEmail": "[email protected]",
  "timeoutPolicy": "ALERT_ONLY",
  "timeoutSeconds": 0,
  "variables": {}
}

This workflows has six repeated tasks that is calling the health check endpoint of conductor, i.e., localhost. Each call shouldn't take more than 100ms max (being very extreme here!) which in the end would lead in the very worst case scenario of total workflow execution time of max 600ms. I am executing this multiple times and I am seeing times like 2 seconds, 2.5 seconds which is unacceptable.

From conductor code, here is the sequence that happens:

The SystemTaskCoordinator class starts a instance of SystemTask for every system task that conductor internally consider as async (e.g.: HTTP, Subworkflow).
An instance of SystemTaskWorker Class then from time to time (defined by systemTaskWorkerPollInterval) polls and executes the task. It fecthes from the queue and assign that task for one available worker of the AsyncTaskExecutor.

With this basic understanding, I enabled debug logs in conductor and started to observe that the tasks are being scheduled, but I am seeing a considerable amount of time for the start time of the task. As an example, looks at this one:

This is 230ms just to start executing the task. What is causing this is because there is a 200ms static blocking sleep call in the SystemTaskWorker. This 200 ms ultimately flows into the particular queue implementation (I am using postgres) and inside that implementation if you don't get enough messages from the queue and you are still under 200ms, the code then blocks. Not as a coincidence, this is the same 200ms reported above by @npk1994. I believe this is in conductor code to decrease the "penalty", i.e., if you go in the queue and nothing is there, conductor is giving a little bit more of time so a task can arrive in the queue. This reduces CPU cycles wasted.

I created a sequence diagram (based in conductor 2.X) to try to explain it:

See what happens with Task 2 (T2): It is scheduled just 3ms after the start o Polling cycle 3 (P3). And the cycle now is blocked by the above mentioned code block. It will only unblock after 200ms (Conductor 2.X) and 100ms (Conductor 3.X). Only after unblocked T2 will be fecthed from the queue. This delays T2.

Note that there are properties for system task workers that could be configured. I already tried:

Increase workflow.system.task.worker.thread.count (I don't think would help here because is not a problem of lack of resources to execute, but tried anyway).
Increase workflow.system.task.worker.poll.interval (The idea here was to cause less poll cycles to happen. Problem is that the more you increase this, you have less cycles, but the intervals between them of course increase which ultimately impacts your polling of the task).

Does above make sense? If yes, how avoid this? It is impacting important use cases and any help is appreciated.

0 replies

mig82 · 2023-08-26T12:32:18Z

mig82
Aug 26, 2023

Hi. We're also considering Conductor for a real-time use case. Has this issue been solved?
From the thread above it first seemed like it was a non-issue, fixable by polling and thread count configuration.
Then there's a comment from a year ago (by @npk1994 mentioning @manan164 ) saying the Orkes team is working on a fix for real-time use cases.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of the Netflix Conductor(In Memory) #3034

{{title}}

Replies: 7 comments 8 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Performance of the Netflix Conductor(In Memory) #3034

Replies: 7 comments · 8 replies

npk1994 Jun 7, 2022 Author

npk1994 Jun 8, 2022 Author

aravindanr Jun 10, 2022 Maintainer

npk1994 Jun 29, 2022 Author

npk1994 Aug 9, 2022 Author

Replies: 7 comments 8 replies

npk1994
Jun 7, 2022
Author

npk1994
Jun 8, 2022
Author

aravindanr
Jun 10, 2022
Maintainer

npk1994 Jun 29, 2022
Author

npk1994 Aug 9, 2022
Author