-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strip down gil.h to the simple version only. #4218
Conversation
All CI failures appear to be of the kind below. Speculation: The failing test relies on
|
@jbms Question regarding e70b38b: + // This used to be triggered by py::gil_scoped_release().
+ // TODO: Explain why/how this is relevant to this unit test.
+ py::detail::get_internals();
REQUIRE(has_pybind11_internals_static()); Locally I tried to remove those 4 lines and the test passes, but does that make sense? I also tried to move the
without adding the explicit |
It looks like that test isn't testing anything that uses pybind11 internals anymore, so it seems like we could just remove |
…ed_acquire_local there).
…QUIRE(has_pybind11_internals_static());` along with it, based on pybind#4218 (comment)
…version), to avoid breaking ABI compatibility. (pybind#4216 (comment))
Cool, thanks! Done: ad126f5 While I was at it, I also took care of this one: be83638 But I see the CI is unhappy, need to look & fix ... |
Given that the default internals version is still 4, I think the |
Just to mention: I made a silly mistake in be83638, it should have been I tried this (locally only):
But even with the "unrelated features" comment in that place, the code could easily be misunderstood by someone new glancing through. We don't have to skimp with the internals numbers, I think making it more obvious that there were two features is more important. |
@wjakob Could you please let us know if we still need the code introduced with 39e97e6#diff-4249f25293fdbc32ba7cb0a011c27ff420af64a7f95931553f953dc41f38915e? |
I looked in nanobind: https://github.com/wjakob/nanobind/blob/526e2c798547185e37f66f77e3876311157eca95/include/nanobind/nb_misc.h It is almost exactly the simple version here (although it is missing #4183). That strongly suggests @wjakob has decided that the complicated |
@henryiii @Skylion007 could you please comment on this PR? My preference is to have this as a bug fix in 2.10.1. Note that this was globally tested, which has been highly conclusive in the past. I'm not aware of any arguments for keeping the code that is known to cause issues in practice, and was found to have problems by inspection (see #4216). |
Have you checked performance? My first thought when seeing large amounts of code with a very simple fallback is that it was done that way for performance1. Some codes might acquire and release the gil quite a bit, and they'll care a lot about performance. I'm also a little worried about dropping in a big change right before releasing a patch release, I'd rather have it sit in master a bit. That why I suggested making it opt-in for the patch release. Hmm, according to the comments, it's because the simple code doesn't work in some cases. What about the three points listed in the code that you've deleted? Specifically, have you compiled NanoGUI with this change? It is supposed to demonstrate the need for the more complex custom code. Maybe our improvements over the years have made this unnecessary, but I'd like a little bit of confirmation. Footnotes
|
If you can prove https://github.com/mitsuba-renderer/nanogui works with this change, I'm in. |
I tested this, and nanogui can't even compile with this change. |
Nanogui could copy the old code to its own repo if it wants to use it, or build using an older version of pybind11. But since the old code is buggy and doesn't appear to be fixable without adding new apis to cpython, it would probably be better for nanogui to use a different approach entirely. I don't think we should keep buggy behavior by default just to support a single user. |
That's fine, but we don't knowingly break users in patch releases. This code has been linked to from the pybind11 source for years, so I'd be quite surprised if this was truly the only user. This can't be stripped out till 2.11. I'm okay to have the new behavior opt-in in 2.10.1, or not, but the default can't change till 2.11. If there are several more users, then I'd say 3.0 is needed for breaking change, I'm only okay with it in a minor release because I'm still assuming the number of users is small. I'll probably at least make a PR to nanogui to keep from breaking them in 2.11, either by limiting pybind11 or by coping the removed code there. |
Actually, I don't think NanoGUI can copy it, because they can't add the required internals pointers. This will kill NanoGUI eventually if they can't adapt to the new code, either by limiting pybind11 version to something that will eventually no longer work, or by requiring an old internals version that we eventually kill off (which I'd like to do, setting the internals version is ugly, it's better if it's linked to pybind11 version). NanoGUI is just the obvious user, there could easily be more (and it has a combined 5k stars and ~800 forks). (It's also a very nice looking project, there aren't many cross platform OSS graphic libraries that can build from scratch. Though it's also lacking working examples and tests...) |
Thanks @henryiii for trying out the nanogui build. I left a message there: |
You should also open an issue in |
Alright, two is a crowd. Thanks for digging this up, too. The first thing we need is a test in pybind11, if we keep that functionality. |
I expect it's really hard to to test threaded code like this, which is why it's missing tests. Also, I'm okay to have the simple version opt-in, then switch to opt-out eventually (Using the "advanced" (buggy) version breaks pypy compatibility anyway, so making the complex version require setup seems quite reasonable even if we can't remove it). Maybe the projects can work out how to avoid this, and then we can drop it. I agree that 95% of users likely will like the simpler version better. |
Let's first hear from them, what they actually need. Maybe we can change-and-reduce the non-simple version to something that works for 100% of pybind11 users?
I agree, but the lack of discipline has a cost, too. My gut estimate: the spread-out cost for people dealing with the buggy behavior, in accumulation, exceeds the cost for getting this tested and right. The problem of course is to collectively get organized enough to fix that. |
Pragmatically, for 2.10.1, my current thinking:
|
Even if the two users we've identified and all the others we haven't found yet are able to quickly adapt, I'm not comfortable shipping a known breaking change for an old feature in a patch release. So we either don't put it in 2.10.1 at all, or we have an option that allows the simpler version to be activated. Since PyTorch would like the simpler version, I'm inclined for the latter. Let's make the simple version optional and non-default in 2.10.1, and default in 2.11.0 but with an opt-out. We can put a warning in the changelog / upgrade guide. We can make PRs fixing those two projects to be ready for 2.11. I think 2.11's not terribly far way. Maybe we can remove this in 2.12/3.0 if the transition is smooth in 2.11. I don't think "repeats of the reported bugs" is that big of an issue - many pybind11 users have happily been using pybind11, and there are lots of other bugs in issues that we can work on if we really want to work out all possible bugs. Let's not cause regressions and grief for ourselves and users, and take this slow. |
I don't know, I've never knowingly released something untested & broken like this, I'm very uncomfortable having my name associated with such a trap. But let's see if we can get more data, maybe it's not as severe as I think?
How much time did you and your teams spend dealing with issues related to the non-simple |
Pybind11 has over 400 issues open. Our test coverage is rather poor, too. We have lots of edge cases and ways to do broken things. Our release and maintenance are on a "best effort" basis. In this case, we do support dissociation of the gil (and have for many years) and we don't support reentry (except on PyPy, apparently). We can change what we support in a minor or major release, but not a patch release. We can add an opt-in flag in a patch release safely. Patch releases are releases that try to only provide fixes. If we make it an opt-in fix, it's still a fix. If we make it opt-out, it's a breaking change. We follow Python's deprecation period for breaking changes - we will release at least one or two releases with warnings before we make a breaking change. I'd like to get a patch release out before Monday. That doesn't provide much time. We have more time to decide for 2.11 if we want to change it from opt-in to opt-out, and potentially set a target version for removing the old form entirely. |
I suppose nanogui can't literally copy the code, but there is no reason it needs to use pybind11's internals to hold the thread local storage identifier --- it can create it its own thread local storage identifier that it stores wherever it wishes. |
Is there reason that the next release needs to be called 2.10.1? It could be called 2.11 instead. |
Because we need to make a release before Python 3.11 release on Monday, and we aren't ready to ship breaking changes. We are not making a breaking change then releasing a couple of days later. Master has already moved on past the 2.10 branch, so we can make it opt-out immediately and then make it opt-in in the backport, I don't care. But "fixing" a perceived bug in pybind11 that's existed for multiple years and forcing your fix on everyone else does not need an emergency minor release tomorrow. If it makes everyone feel better, I can just not backport #4216 at all and you can pretend "2.10.0" and "2.10.1" are the same thing. |
To be clear, the pybind11 gil handling is not untested & broken. It is tested, and works for the 100K+ pybind11 users. It simply does not support nesting, and this change would allow it to support nesting, which at least a few users (one identified) want, at the expense of being able to dissociate the handling from the current thread, which at least a couple of (identified) users need. |
Most of them are stale and none rises to the same level of concern, except the severe issues (#1138, #1333) that are fixed in the smart_holder branch.
That is not true. "Not complete" is fair to say, but generally we have >90% coverage.
Yeah, I already suggested abandoning 2.10 for another reason, suspected unintentional ABI break: #4125 (comment)
What is the minimum we'd need to do for 3.11? — I don't know off the top of my head and need to go look through the diffs. |
It is obviously untested in pybind11 (which is what I meant), otherwise the CI for this PR wouldn't pass. |
If we take this more slowly, we can copy the code needed for nanogui, provide that as an example in the upgrade guide for 3.11, and gently move to the new version in a few weeks - 2.11 is not years away or anything terrible. The release in the next couple of days could include this as an opt-in preview, which PyTorch (which I think already worked around this anyway?) could then opt-in to, along with anyone else that encounters this issue. There is no need to rush 2.11 just because we need to ship a few fixes for 2.10 before Python 3.11. |
I meant the gil handling is tested. The only thing that's not tested is the dissociation feature (which is hard to test in a test running a python process), or nesting (which is unsupported before this). But the gil handling code is very much tested, and used in many projects. |
A bit random, but maybe useful for later, this PR does NOT fix the TSAN error I've seen ever since I first ran the pybind11 tests with the clang sanitizers (~2 years ago). See also PR #2754. (Note that the assertion failures also appear regularly in flakes, (almost?) always Windows.)
|
Did you test this on anything that was supposed to be fixed by it? I copied the contents of #1276 on ubuntu 18.04 (since I was there for something else) and still get: $ git fetch origin pull/4218/head:MASTER
$ c++ I/usr/include/python3.6m -I/usr/include/python3.6m -I/pybind11/include -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -flto -fuse-linker-plugin -ffat-lto-objects -std=c++11 gil.cpp -Wl,-Bsymbolic-functions -Wl,-z,relro -L/usr/lib/python3.6/config-3.6m-x86_64-linux-gnu -lpython3.6m -lpthread -ldl -lutil -lm -o gil
$ ./gil
Calling C++ code
Segmentation fault The second example passes, even on the old version (I'm using On macOS, both before and after this patch, the examples pass. On Ubuntu 20.04 and 22.04, both pass as well. So this patch has no visible effect on #1276, which seems to be resolved or at least better on most newer compilers. # On henryiii:henryiii/feat/clihelper branch
docker run --rm -v$PWD:/pybind11 -w /pybind11 -it ubuntu:22.04
apt update && apt install -y g++ python3-dev
compile_command=$(python3 -m pybind11 --embed --file=gil.cpp)
compile_command2=$(python3 -m pybind11 --embed --file=gil2.cpp)
# git checkout testing version in another window
c++ $compile_command && ./gil
c++ $compile_command2 && ./gil2 |
To answer this question (sorry for delay): I didn't test with the #1276 code and only glanced at that bug and #1322 a little bit before. My starting point was #4216, the idea here was to try what happens if we only have the simple code: the CI passed, and Google global testing. #1276 is from 4 yr 8 mo ago. I don't see references to Python 3 in the comments, only Python 2.7. I didn't spot details about platforms and Python versions. With so much uncertainty after such a long time, is this still something useful to look at? I'm thinking it's best to leave that old issue behind (not reference it anymore) and only focus on #4216. Considering the uses of the non-simple version you found (comments here), this PR is too radical. I agree we need to keep the non-simple version around for a while. Our only difference seems to be when to flip the default, with me on the "as soon as possible" side, and you on the "let's wait" IIUC. For Google-internal use, I still want to flip the default as soon as possible. I don't want other Google engineers to find out the hard way (like the PyTorch people) that they need the simple version, when they probably never even became aware of the features of the non-simple version, but just wanted a convenient way to ensure the GIL is held in a section of code. I could flip the default only on the smart_holder branch, that would achieve my goal for Google's purposes. I'll convert this PR to draft mode. Rationale: Q: Is it better to just add a define as in #4216, or also fold in the |
Closing this PR now, I think it will be completely stale by the time we get to removing the non-simple code. The change to test_embed/test_interpreter.cpp needed to be transferred to #4216, so that the tests pass when |
Description
Prompted by discussion under PR #4216. Please see there for background.
This PR removes ~170 lines of code, and two
internals
variables (guarded by internals version).This passes Google-global testing (internal testing ID OCL:479233883:BASE:479302840:1665064423602:50af055b).
Suggested changelog entry:
An outdated version of `gil_scoped_acquire` was removed from pybind11/gil.h, resolving #1276 and pytorch/pytorch#83101.