Distributed locking for conductor #2600

flavioschuindt · 2021-11-25T07:28:02Z

flavioschuindt
Nov 25, 2021

Hi, guys.

Sometimes I face "weird" problems with conductor, specially when I am stressing it a lot (thousands of workflow executions in parallel). Problems like workflow stuck with task in IN_PROGRESS even tough task was already completed, workflow termination abruptly, etc. I can't say 100% sure, but doing a quick search on the open issues, lots of them pointing to questions like "What distributed locking are you using?". I understand this type of situation if I had two conductor instances running in parallel, but I just have one. In any case, I would like to understand:

What is the distributed locking in that case for just one conductor instance? Does it really make sense?
Do you have an end-to-end getting started to setup distributed locking in conductor? What are the steps? What should I do? How to test it properly and make sure it is working?

I could only find this section in the docs, but this is very high level. I would like some guidance here if possible. If that matters, my current configuration is a single conductor instance with postgres as database. No redis and no dynomite.

Thank you in advance!

Answered by apanicker-nflx

Dec 15, 2021

is there any recommendation from conductor between redis-lock and zookeeper-lock? What are the pros and cons of each one specifically in the context of conductor?

In the context of Conductor, if you are already using redis-persistence, it would be easier for you to setup redis-lock. Other than that, functionally both of these implementations work well as per our testing. We noticed that the implementation using redis was more performant at higher loads, however the difference between the two implementations is not significant.

when you say "there could be two or more separate threads", are you referring to a single worker instance, but multiple threads configured due to the parameter .…

View full answer

apanicker-nflx · 2021-11-29T23:05:13Z

apanicker-nflx
Nov 29, 2021
Maintainer

@flavioschuindt The problems that you have mentioned above could be caused by concurrency issues to prevent which the distributed locking interface was introduced to Conductor.

What is the distributed locking in that case for just one conductor instance? Does it really make sense?

In this case, (in the absence of distributed lock) there could be two or more separate threads on the same instance that could be evaluating a given workflow state at the same time.
The different actors in this case could be

updateTask API calls for parallel tasks in a given workflow from multiple workers reaching the server simultaneously.
WorkflowSweeper evaluating a given workflow at the same time as an updateTask for any task within that workflow.

In either of these cases, there are race conditions between which thread updates the workflow state in the persistence layer. This leads to the workflow being in an inconsistent state which cannot be auto-recovered from.
As part of the recent changes with 3.0 upgrade, a new background service named WorkflowRepairService was introduced to aid with auto-repair in few of these cases. But this service alone cannot provide complete resiliency and we highly recommend using a distributed lock.

Do you have an end-to-end getting started to setup distributed locking in conductor? What are the steps? What should I do? How to test it properly and make sure it is working?

There are two options available for this - redis-lock or zookeeper-lock. The specific steps to setup either of these systems should be available on the specific product pages. The integration with Conductor is completely configuration-driven by setting the properties for redis or zookeeper. Additionally, you would need to enable locking using the property - conductor.app.workflowExecutionLockEnabled

0 replies

flavioschuindt · 2021-11-29T23:55:25Z

flavioschuindt
Nov 29, 2021
Author

Hey @apanicker-nflx, very good intro. Thanks for that. So, is there any recommendation from conductor between redis-lock and zookeeper-lock? What are the pros and cons of each one specifically in the context of conductor? I will start to explore those, but if you have any inisght already that would be good to share.

By the way, when you say "there could be two or more separate threads", are you referring to a single worker instance, but multiple threads configured due to the parameter .withThreadCount(threadCount) in the WorkflowTaskCoordinator?

0 replies

apanicker-nflx · 2021-12-15T22:27:28Z

apanicker-nflx
Dec 15, 2021
Maintainer

is there any recommendation from conductor between redis-lock and zookeeper-lock? What are the pros and cons of each one specifically in the context of conductor?

In the context of Conductor, if you are already using redis-persistence, it would be easier for you to setup redis-lock. Other than that, functionally both of these implementations work well as per our testing. We noticed that the implementation using redis was more performant at higher loads, however the difference between the two implementations is not significant.

when you say "there could be two or more separate threads", are you referring to a single worker instance, but multiple threads configured due to the parameter .withThreadCount(threadCount) in the WorkflowTaskCoordinator

No, that would be on the client. What I am referring to is -

The workflow evaluation within the WorkflowSweeper (background server thread)
An updateTask request for a task within a given workflow which will be evaluated for workflow progression (in the request thread)
If there are parallel forks, this updateTask could trigger workflow evaluations on multiple threads based on update requests from multiple workers

1 reply

flavioschuindt Dec 15, 2021
Author

Thanks a lot, @apanicker-nflx. This helps and makes sense. Talking about the redis lock module, I believe I found an issue which I am describing at #2642. Could you please take a look and add your thoughts?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed locking for conductor #2600

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Distributed locking for conductor #2600

flavioschuindt Nov 25, 2021

Replies: 3 comments · 1 reply

apanicker-nflx Nov 29, 2021 Maintainer

flavioschuindt Nov 29, 2021 Author

apanicker-nflx Dec 15, 2021 Maintainer

flavioschuindt Dec 15, 2021 Author

flavioschuindt
Nov 25, 2021

Replies: 3 comments 1 reply

apanicker-nflx
Nov 29, 2021
Maintainer

flavioschuindt
Nov 29, 2021
Author

apanicker-nflx
Dec 15, 2021
Maintainer

flavioschuindt Dec 15, 2021
Author