-
Notifications
You must be signed in to change notification settings - Fork 477
fix(celery): stop native runtime before celery daemonization #16050
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codeowners resolved as |
Performance SLOsComparing candidate vianney/fix-celery-panic (accd6b5) with baseline main (deb2792) 📈 Performance Regressions (3 suites)📈 iastaspectsospath - 24/24✅ ospathbasename_aspectTime: ✅ 503.667µs (SLO: <700.000µs 📉 -28.0%) vs baseline: 📈 +21.0% Memory: ✅ 42.880MB (SLO: <46.000MB -6.8%) vs baseline: +5.7% ✅ ospathbasename_noaspectTime: ✅ 427.828µs (SLO: <700.000µs 📉 -38.9%) vs baseline: +0.8% Memory: ✅ 42.959MB (SLO: <46.000MB -6.6%) vs baseline: +5.8% ✅ ospathjoin_aspectTime: ✅ 630.204µs (SLO: <700.000µs -10.0%) vs baseline: +2.5% Memory: ✅ 42.959MB (SLO: <46.000MB -6.6%) vs baseline: +5.5% ✅ ospathjoin_noaspectTime: ✅ 639.863µs (SLO: <700.000µs -8.6%) vs baseline: +3.0% Memory: ✅ 42.900MB (SLO: <46.000MB -6.7%) vs baseline: +5.4% ✅ ospathnormcase_aspectTime: ✅ 348.305µs (SLO: <700.000µs 📉 -50.2%) vs baseline: +0.5% Memory: ✅ 42.959MB (SLO: <46.000MB -6.6%) vs baseline: +5.5% ✅ ospathnormcase_noaspectTime: ✅ 357.273µs (SLO: <700.000µs 📉 -49.0%) vs baseline: +1.4% Memory: ✅ 42.900MB (SLO: <46.000MB -6.7%) vs baseline: +5.9% ✅ ospathsplit_aspectTime: ✅ 494.273µs (SLO: <700.000µs 📉 -29.4%) vs baseline: +4.5% Memory: ✅ 42.920MB (SLO: <46.000MB -6.7%) vs baseline: +5.6% ✅ ospathsplit_noaspectTime: ✅ 505.901µs (SLO: <700.000µs 📉 -27.7%) vs baseline: +4.6% Memory: ✅ 42.998MB (SLO: <46.000MB -6.5%) vs baseline: +5.9% ✅ ospathsplitdrive_aspectTime: ✅ 376.989µs (SLO: <700.000µs 📉 -46.1%) vs baseline: +2.8% Memory: ✅ 42.920MB (SLO: <46.000MB -6.7%) vs baseline: +5.5% ✅ ospathsplitdrive_noaspectTime: ✅ 73.219µs (SLO: <700.000µs 📉 -89.5%) vs baseline: ~same Memory: ✅ 42.959MB (SLO: <46.000MB -6.6%) ✅ ospathsplitext_aspectTime: ✅ 454.633µs (SLO: <700.000µs 📉 -35.1%) vs baseline: +0.3% Memory: ✅ 42.959MB (SLO: <46.000MB -6.6%) vs baseline: +5.7% ✅ ospathsplitext_noaspectTime: ✅ 469.802µs (SLO: <700.000µs 📉 -32.9%) vs baseline: +1.5% Memory: ✅ 42.979MB (SLO: <46.000MB -6.6%) vs baseline: +5.8% 📈 iastaspectssplit - 12/12✅ rsplit_aspectTime: ✅ 161.030µs (SLO: <250.000µs 📉 -35.6%) vs baseline: 📈 +10.3% Memory: ✅ 42.959MB (SLO: <46.000MB -6.6%) vs baseline: +5.3% ✅ rsplit_noaspectTime: ✅ 158.044µs (SLO: <250.000µs 📉 -36.8%) vs baseline: +3.1% Memory: ✅ 42.979MB (SLO: <46.000MB -6.6%) vs baseline: +5.9% ✅ split_aspectTime: ✅ 148.339µs (SLO: <250.000µs 📉 -40.7%) vs baseline: +3.0% Memory: ✅ 42.979MB (SLO: <46.000MB -6.6%) vs baseline: +6.0% ✅ split_noaspectTime: ✅ 156.032µs (SLO: <250.000µs 📉 -37.6%) vs baseline: +2.4% Memory: ✅ 43.037MB (SLO: <46.000MB -6.4%) vs baseline: +5.6% ✅ splitlines_aspectTime: ✅ 145.168µs (SLO: <250.000µs 📉 -41.9%) vs baseline: -0.3% Memory: ✅ 42.900MB (SLO: <46.000MB -6.7%) vs baseline: +5.6% ✅ splitlines_noaspectTime: ✅ 149.761µs (SLO: <250.000µs 📉 -40.1%) vs baseline: -2.2% Memory: ✅ 42.880MB (SLO: <46.000MB -6.8%) vs baseline: +5.5% 📈 telemetryaddmetric - 30/30✅ 1-count-metric-1-timesTime: ✅ 3.423µs (SLO: <20.000µs 📉 -82.9%) vs baseline: 📈 +12.5% Memory: ✅ 35.271MB (SLO: <38.000MB -7.2%) vs baseline: +5.1% ✅ 1-count-metrics-100-timesTime: ✅ 205.443µs (SLO: <220.000µs -6.6%) vs baseline: +2.9% Memory: ✅ 35.212MB (SLO: <38.000MB -7.3%) vs baseline: +4.7% ✅ 1-distribution-metric-1-timesTime: ✅ 3.286µs (SLO: <20.000µs 📉 -83.6%) vs baseline: -3.5% Memory: ✅ 35.547MB (SLO: <38.000MB -6.5%) vs baseline: +5.9% ✅ 1-distribution-metrics-100-timesTime: ✅ 213.127µs (SLO: <230.000µs -7.3%) vs baseline: -1.1% Memory: ✅ 35.468MB (SLO: <38.000MB -6.7%) vs baseline: +5.5% ✅ 1-gauge-metric-1-timesTime: ✅ 2.148µs (SLO: <20.000µs 📉 -89.3%) vs baseline: -1.9% Memory: ✅ 35.193MB (SLO: <38.000MB -7.4%) vs baseline: +4.6% ✅ 1-gauge-metrics-100-timesTime: ✅ 136.537µs (SLO: <150.000µs -9.0%) vs baseline: +0.1% Memory: ✅ 35.173MB (SLO: <38.000MB -7.4%) vs baseline: +4.6% ✅ 1-rate-metric-1-timesTime: ✅ 3.068µs (SLO: <20.000µs 📉 -84.7%) vs baseline: -3.4% Memory: ✅ 35.350MB (SLO: <38.000MB -7.0%) vs baseline: +5.2% ✅ 1-rate-metrics-100-timesTime: ✅ 218.922µs (SLO: <250.000µs 📉 -12.4%) vs baseline: +3.0% Memory: ✅ 35.291MB (SLO: <38.000MB -7.1%) vs baseline: +5.0% ✅ 100-count-metrics-100-timesTime: ✅ 20.831ms (SLO: <22.000ms -5.3%) vs baseline: +3.2% Memory: ✅ 35.488MB (SLO: <38.000MB -6.6%) vs baseline: +5.5% ✅ 100-distribution-metrics-100-timesTime: ✅ 2.227ms (SLO: <2.550ms 📉 -12.7%) vs baseline: -0.8% Memory: ✅ 35.586MB (SLO: <38.000MB -6.4%) vs baseline: +6.2% ✅ 100-gauge-metrics-100-timesTime: ✅ 1.403ms (SLO: <1.550ms -9.5%) vs baseline: ~same Memory: ✅ 35.488MB (SLO: <38.000MB -6.6%) vs baseline: +5.7% ✅ 100-rate-metrics-100-timesTime: ✅ 2.263ms (SLO: <2.550ms 📉 -11.2%) vs baseline: +2.4% Memory: ✅ 35.507MB (SLO: <38.000MB -6.6%) vs baseline: +5.7% ✅ flush-1-metricTime: ✅ 4.539µs (SLO: <20.000µs 📉 -77.3%) vs baseline: -1.6% Memory: ✅ 35.665MB (SLO: <38.000MB -6.1%) vs baseline: +4.9% ✅ flush-100-metricsTime: ✅ 174.262µs (SLO: <250.000µs 📉 -30.3%) vs baseline: +0.1% Memory: ✅ 35.547MB (SLO: <38.000MB -6.5%) vs baseline: +4.6% ✅ flush-1000-metricsTime: ✅ 2.178ms (SLO: <2.500ms 📉 -12.9%) vs baseline: -1.4% Memory: ✅ 36.431MB (SLO: <38.750MB -6.0%) vs baseline: +4.7% 🟡 Near SLO Breach (1 suite)🟡 tracer - 6/6✅ largeTime: ✅ 31.640ms (SLO: <32.950ms -4.0%) vs baseline: ~same Memory: ✅ 36.766MB (SLO: <39.250MB -6.3%) vs baseline: +5.4% ✅ mediumTime: ✅ 3.149ms (SLO: <3.200ms 🟡 -1.6%) vs baseline: +0.7% Memory: ✅ 35.212MB (SLO: <38.750MB -9.1%) vs baseline: +4.7% ✅ smallTime: ✅ 365.021µs (SLO: <370.000µs 🟡 -1.3%) vs baseline: +3.0% Memory: ✅ 35.291MB (SLO: <38.750MB -8.9%) vs baseline: +4.9%
|
This comment has been minimized.
This comment has been minimized.
brettlangdon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we able to create a regression test for this?
is it a race condition or generally.. yeah... going to happen always?
if it is a race condition, I wonder if we can patch close_open_fds to add a sleep to it to sort of given time for the native writer to spin up and make the /info call ?
Co-authored-by: Brett Langdon <[email protected]>
Is it fine to start the celery process with popen in the test or should I find a way to call celery from the python test ? |
Yeah, totally fine, starting a beat subprocess in a test would be totally fine. Just need to figure out the right mechanism for knowing if it crashed or not, which could be giving it a task to process or just checking that the app doesn't crash on boot or w.e. |
e17b78a to
8fb21a5
Compare
|
This change is marked for backport to 4.3 and it does not conflict with that branch. |
brettlangdon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some nits, but otherwise lgtm.
Co-authored-by: Brett Langdon <[email protected]>
Co-authored-by: Brett Langdon <[email protected]>
|
This change is marked for backport to 4.3 and it does not conflict with that branch. |
## Description When celery starts beat (celery's scheduler) it creates a new process by forking and then closing all file descriptors. This causes panics in the rust tokio runtime as this runtime is created in `after_fork` hooks and manipulates epoll file descriptors (which are closed by celery). We fix the issue by shutting down the tokio runtime before calling `close_open_fds` and starting it again afterwards. ### Note This is done by considering `close_open_fds` as fork and calling all fork hooks, this is necessary since any tokio runtime will encounter this issue otherwise (e.g. NativeWriter, [profile manager](DataDog/libdatadog#1464)). This doesn't cause any problem for existing fork hooks however this could cause unexpected behavior with future hooks. The ongoing work to share a single tokio native runtime for native code should allow us to refactor this to only shut down this runtime. ## Testing Run any celery app with `celery -A celery_app worker --beat` or just the beat worker `celery beat --detached`. No beat traces sent without the fix, works with the fix. A regression test has been added to check the absence of panic in the beat process. ## Risks None ## Additional Notes See [this page](https://datadoghq.atlassian.net/wiki/spaces/VP/pages/6035472438/Investigation+Summary+of+GWCP+Crash+with+dd-trace-py+and+Celery+Integration+Issues) for details on the bug. --------- Co-authored-by: Brett Langdon <[email protected]> Co-authored-by: Emmett Butler <[email protected]> (cherry picked from commit 4889a66) Co-authored-by: vianney <[email protected]>
…t 4.3] (#16283) Backport #16050 to 4.3 Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: vianney <[email protected]>
Description
When celery starts beat (celery's scheduler) it creates a new process by forking and then closing all file descriptors. This causes panics in the rust tokio runtime as this runtime is created in
after_forkhooks and manipulates epoll file descriptors (which are closed by celery). We fix the issue by shutting down the tokio runtime before callingclose_open_fdsand starting it again afterwards.Note
This is done by considering
close_open_fdsas fork and calling all fork hooks, this is necessary since any tokio runtime will encounter this issue otherwise (e.g. NativeWriter, profile manager). This doesn't cause any problem for existing fork hooks however this could cause unexpected behavior with future hooks. The ongoing work to share a single tokio native runtime for native code should allow us to refactor this to only shut down this runtime.Testing
Run any celery app with
celery -A celery_app worker --beator just the beat workercelery beat --detached. No beat traces sent without the fix, works with the fix. A regression test has been added to check the absence of panic in the beat process.Risks
None
Additional Notes
See this page for details on the bug.