Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] When kube-ovn-controller restarts, need takes a long time to start process queue tasks #4849

Open
cmdy opened this issue Dec 18, 2024 · 5 comments · May be fixed by #4895
Open

[BUG] When kube-ovn-controller restarts, need takes a long time to start process queue tasks #4849

cmdy opened this issue Dec 18, 2024 · 5 comments · May be fixed by #4895
Labels
bug Something isn't working performance Anything that can make Kube-OVN faster subnet

Comments

@cmdy
Copy link
Contributor

cmdy commented Dec 18, 2024

Kube-OVN Version

v1.12.29

Kubernetes Version

v1.28.11

Operation-system/Kernel Version

"CentOS Linux 7 (Core)" 5.10.0-228.2410.el7.bzl.x86_64

Description

  1. when kube-ovn-controller restarts, need takes a long time(1 hour+) to start process queue tasks in a large cluster(2000+Node,15w+Pod,100+subnet)
  2. work queue depth monitor
    image

Steps To Reproduce

  1. creating a large cluster(2000+Node,15w+Pod,100+subnet)
  2. restart kube-ovn-controller pod

Current Behavior

need takes a long time(1 hour+) to start queue tasks process

Expected Behavior

fast to start queue tasks process

@cmdy cmdy added the bug Something isn't working label Dec 18, 2024
@dosubot dosubot bot added performance Anything that can make Kube-OVN faster subnet labels Dec 18, 2024
Copy link

dosubot bot commented Dec 18, 2024

The issue you're experiencing with the kube-ovn-controller taking a long time to start processing queue tasks after a restart in a large cluster is likely due to two main factors:

  1. Slow Policy Route Addition: Each policy route addition operation is taking 1-3 seconds, which significantly delays the processing of the entire queue.

  2. Duplicate Operations: There are duplicate operations for the same node and subnet combinations, resulting in more operations than expected, which further slows down the process.

These factors contribute to high work queue latency and slow queue depth reduction, causing the processing to take several hours instead of minutes [1].

To improve the processing speed, you can increase the WorkerNum parameter using the --worker-num flag, which controls the parallelism of each worker. By default, it is set to 3, but increasing this number can help process more tasks concurrently in a large cluster [2].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@oilbeater
Copy link
Collaborator

This is related to #4822 , and #4835 remove some duplicate policy add. Can you check the logs to see what else happened between kube-ovn-controller restart and the controller start to process the queued event.

@zsxsoft
Copy link

zsxsoft commented Dec 19, 2024

Author of #4822, I experienced a similar issue with ~300 subnets and 10 nodes, where the kube-ovn-controller takes over 30 minutes to start.
The trend of AddSubnet event in my WFD chart is similar to your AddNode. I suspect that your subnets are centralized, as distributed subnets might take even longer to process.

@cmdy
Copy link
Contributor Author

cmdy commented Dec 19, 2024

This is related to #4822 , and #4835 remove some duplicate policy add. Can you check the logs to see what else happened between kube-ovn-controller restart and the controller start to process the queued event.

this problem is caused by the execution of the init operation after the controller is started. In the init operation, initNodeRouter is very slow. I think the reason for the slowness is that there are too many addPolicyRouteToVpc operations. for this kind of initialization, batch operations should be used to reduce the number of accesses to the northbound database. i will submit a PR later, you can review it.

@cmdy
Copy link
Contributor Author

cmdy commented Dec 30, 2024

PR is ready to be reviewed, please review @oilbeater
#4884

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance Anything that can make Kube-OVN faster subnet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants