Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock when using ProcessPoolExecutor with stack v2 profiler #11762

Open
oranav opened this issue Dec 17, 2024 · 8 comments
Open

Deadlock when using ProcessPoolExecutor with stack v2 profiler #11762

oranav opened this issue Dec 17, 2024 · 8 comments
Assignees
Labels
Profiling Continous Profling

Comments

@oranav
Copy link
Contributor

oranav commented Dec 17, 2024

We've encountered deadlocks which prevent the process from exiting successfully when using the V2 stack profiler.
Python version is 3.11.11, ddtrace version is 2.17.3, architecture is linux/amd64.

The child process is stuck on Datadog::Sampler::register_thread:

(gdb) bt
#0  futex_wait (private=0, expected=2, futex_word=0x7f78af07e760) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait (futex=futex@entry=0x7f78af07e760, private=0) at ./nptl/lowlevellock.c:49
#2  0x00007f78b0f2b452 in lll_mutex_lock_optimized (mutex=0x7f78af07e760) at ./nptl/pthread_mutex_lock.c:48
#3  ___pthread_mutex_lock (mutex=0x7f78af07e760) at ./nptl/pthread_mutex_lock.c:93
#4  0x00007f78af06ea8a in ?? ()
   from /app/venv/lib/python3.11/site-packages/ddtrace/internal/datadog/profiling/stack_v2/_stack_v2.cpython-311-x86_64-linux-gnu.so
#5  0x00007f78af074269 in Datadog::Sampler::register_thread(unsigned long, unsigned long, char const*) ()
   from /app/venv/lib/python3.11/site-packages/ddtrace/internal/datadog/profiling/stack_v2/_stack_v2.cpython-311-x86_64-linux-gnu.so

And then the parent process waits in join() forever.

Here's a simple reproduction script:

import time
from concurrent.futures import ProcessPoolExecutor


def worker():
    time.sleep(0.1)


def main():
    with ProcessPoolExecutor(max_workers=16) as executor:
        for _ in range(128):
            executor.submit(worker)


if __name__ == "__main__":
    main()

Run with:

DD_PROFILING_ENABLED=1 DD_PROFILING_STACK_V2_ENABLED=1 ddtrace-run python3 test.py

Just run a couple of times (it's non-deterministic), and you'll see hangups every now and then.

I've also noticed that the stuck process is always the first worker process created - might lead to some clue.

Thanks!

@sanchda
Copy link
Contributor

sanchda commented Dec 17, 2024

Whoa! Very meaningful issue, @oranav. Deepest apologies for the inconvenience, but thank you for the report. Will work on this ASAP.

@sanchda sanchda self-assigned this Dec 17, 2024
@sanchda sanchda added the Profiling Continous Profling label Dec 17, 2024
@sanchda
Copy link
Contributor

sanchda commented Dec 18, 2024

Took a careful look at this, and thanks to some insight from @taegyunkim I think #11768 fixes it.

@sanchda
Copy link
Contributor

sanchda commented Dec 18, 2024

Will update this thread with

  1. Timeline for release. Note that the release will be backported to 2.16 onward.
  2. A potential strategy for pulling a pre-release artifact for deployment in affected environments (not sure how durable this will be--but hey, if your application is deadlocking, then it's either disable the feature, wait for a release, or do something else, right?)

@oranav; thanks again for the comprehensive overview, the thoughtful discussion, and the effective reproduction. This really improves our ability to provide a speedy fix. 🙇

@oranav
Copy link
Contributor Author

oranav commented Dec 18, 2024

Thanks @sanchda! Indeed one of my suspicions was that the process forks while the mutex is held, but I thought forking must happen while the GIL is taken, and I thought Echion's mutices are all taken under GIL - but apparently I missed something :)

sanchda added a commit that referenced this issue Dec 19, 2024
I'm not sure why it took so long to surface this defect, but it turns
out that stack v2 can deadlock applications because not all mutices are
reset.

The repro in #11762 appears to be pretty durable. I need to investigate
it a bit more in order to distill it down into a native stress test we
can use moving forward. In practice, this patch suppresses the noted
behavior in the repro.

## Checklist
- [X] PR author has checked that all the criteria below are met
- The PR description includes an overview of the change
- The PR description articulates the motivation for the change
- The change includes tests OR the PR description describes a testing
strategy
- The PR description notes risks associated with the change, if any
- Newly-added code is easy to change
- The change follows the [library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
- The change includes or references documentation updates if necessary
- Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist
- [x] Reviewer has checked that all the criteria below are met 
- Title is accurate
- All changes are related to the pull request's stated goal
- Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- Testing strategy adequately addresses listed risks
- Newly-added code is easy to change
- Release note makes sense to a user of the library
- If necessary, author has acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment
- Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)

---------

Co-authored-by: Taegyun Kim <[email protected]>
@sanchda
Copy link
Contributor

sanchda commented Dec 19, 2024

#11768 has been merged, so this should be fixed in an upcoming release.

but I thought forking must happen while the GIL is taken, and I thought Echion's mutices are all taken under GIL
You're correct that os.fork() will happen under the GIL, but as you suspect there are a few wrinkles.

  • Our use of Echion places it on a native thread, where it is not beholden to the GIL in any way. In particular, that means the state of each mutex is indeterminate at time of fork() and we have to do... uh... stuff. To make the world right again.

  • That pattern must extend to all of the other native mutexes in the project.

@sanchda
Copy link
Contributor

sanchda commented Dec 19, 2024

I'm going to keep this open until an end-user reports that the new release fixes the problem. @oranav ; if you like, I can direct you to a workflow for getting access to a pre-release build, but otherwise I can update this thread when a known official release has been made available.

sanchda added a commit that referenced this issue Dec 19, 2024
I'm not sure why it took so long to surface this defect, but it turns
out that stack v2 can deadlock applications because not all mutices are
reset.

The repro in #11762 appears to be pretty durable. I need to investigate
it a bit more in order to distill it down into a native stress test we
can use moving forward. In practice, this patch suppresses the noted
behavior in the repro.

- [X] PR author has checked that all the criteria below are met
- The PR description includes an overview of the change
- The PR description articulates the motivation for the change
- The change includes tests OR the PR description describes a testing
strategy
- The PR description notes risks associated with the change, if any
- Newly-added code is easy to change
- The change follows the [library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
- The change includes or references documentation updates if necessary
- Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

- [x] Reviewer has checked that all the criteria below are met
- Title is accurate
- All changes are related to the pull request's stated goal
- Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- Testing strategy adequately addresses listed risks
- Newly-added code is easy to change
- Release note makes sense to a user of the library
- If necessary, author has acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment
- Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)

---------

Co-authored-by: Taegyun Kim <[email protected]>
(cherry picked from commit d855c4a)
@sanchda
Copy link
Contributor

sanchda commented Dec 20, 2024

FYI, this will be released (later today, I hope) in 2.18.1. I'm also attempting to back-port to the 2.17 and 2.16 lines (🤞). It'll be part of mainline starting in the 2.19.0 release.

@sanchda
Copy link
Contributor

sanchda commented Dec 20, 2024

Confirming that 2.18.1 shipped. Would love to hear some folks weigh in on whether or not it solved this problem for them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Profiling Continous Profling
Projects
None yet
Development

No branches or pull requests

2 participants