Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: improve celery performance #1165

Open
wants to merge 1 commit into
base: release
Choose a base branch
from

Conversation

Ian2012
Copy link
Contributor

@Ian2012 Ian2012 commented Nov 22, 2024

Description

This removes the gossip and mingle features to improve Celery's performance and reduce the number of messages sent to Redis. It also increases the heartbeat interval to 60s from 2s.

  • without-gossip: Based on the official celery docs, workers are subscribing to worker-related events like heartbeats, and can detect if any worker goes offline, Celery by default does nothing with this information, but this opens the door for task rerouting or restarting workers when they crash (not implemented). In the context of Open edX, there isn't any implementation that depends on this feature, and the Celery changelog only shows fixes for this flag, also this scales to n^2 increasing dramatically network traffic which isn't needed.
  • without-mingle: It syncs the revoked tasks when a worker in the same cluster starts or restarts, avoiding rerunning revoked tasks. I don't have strong arguments against this flag, however, it's highly recommended by different operators to remove it.

For the heartbeat, maybe 20s is a better default in case of network issues or Redis downtimes. However, keep in mind that this heartbeat does nothing. If you disable the heartbeat the workers still will detect if the broker goes down and start the reconnect process, so we should consider removing it completely.

References:

https://www.cloudamqp.com/blog/python-celery-and-rabbitmq.html
https://ankurdhuriya.medium.com/understanding-celery-workers-concurrency-prefetching-and-heartbeats-85707f28c506

@Ian2012
Copy link
Contributor Author

Ian2012 commented Nov 22, 2024

Other setting we should take into account:

@Ian2012 Ian2012 marked this pull request as ready for review November 22, 2024 19:50
@regisb
Copy link
Contributor

regisb commented Nov 25, 2024

I agree that we should add --without-mingle and --without-gossip. But it seems that there is a misunderstanding about the heartbeat. If I understand this article correctly (which has more information than the official docs...), the heartbeat is not to detect whether the broker is down, but for the broker to detect when a worker is down. When a worker is down, then its tasks are reassigned to another worker. Surely, we don't want to wait 1 minute until tasks are reassigned. I'm not sure whether we should increase from 2s. If we do, we shouldn't go beyond 5-10s (IMHO).

Concerning -O fair: @igobranco was suggesting here that we add --prefetch-multiplier=1. This would resolve the issue without adding -O fair, right?

@Ian2012
Copy link
Contributor Author

Ian2012 commented Nov 25, 2024

If there is a concern about the heartbeat, we can use the default for now.

Yes, setting the prefetch-multiplier to 1 should resolve that issue too, this setting is what improved the performance on a large instance and helped to not lose tasks when workers are OOM Killed or pods are deleted forcefully, this with ACKS_LATE enabled, however, this would require celery tasks to be idempotent, such as course indexing, grade calculation or reports, however I think that's a large discussion that's out of scope for this PR. See the docs for more information.

I will update the PR with the new defaults.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In review
Development

Successfully merging this pull request may close these issues.

3 participants