Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition leading to hanging tests using coverage >=7.5.5 #1892

Open
digitalresistor opened this issue Nov 15, 2024 · 4 comments
Open
Labels
bug Something isn't working question Further information is requested

Comments

@digitalresistor
Copy link

Describe the bug

On the waitress project we use coverage along with pytest-cov to compute coverage on all runs. Most recently we received a new contribution that fired CI across the test matrix, which included hanging in tests/test_functional.py. These tests spin up a server (with threads) using multiprocessing.

The developer who was adding new changes caught the issue and provided a stack trace when they hit Ctrl+C due to the test suite hanging:

Pylons/waitress#446 (comment)

Copied in its entirety here:

platform linux -- Python 3.12.7, pytest-8.3.3, pluggy-1.5.0
rootdir: .../projects/waitress
configfile: setup.cfg
testpaths: tests
plugins: cov-5.0.0
collected 796 items                                                                                                                                                                                                                                                                                                                                                                         

tests/test_adjustments.py .................................................                                                                                                                                                                                                                                                                                                           [  6%]
tests/test_buffers.py ....................................................                                                                                                                                                                                                                                                                                                            [ 12%]
tests/test_channel.py .........................................................................................................................                                                                                                                                                                                                                                       [ 27%]
tests/test_functional.py ...................................................................................^CTraceback (most recent call last):
  File ".../projects/waitress/src/waitress/server.py", line 325, in run
    self.asyncore.loop(
  File ".../projects/waitress/src/waitress/wasyncore.py", line 245, in loop
    poll_fun(timeout, map)
  File ".../projects/waitress/src/waitress/wasyncore.py", line 183, in poll
    read(obj)
  File ".../projects/waitress/src/waitress/wasyncore.py", line 104, in read
    obj.handle_read_event()
  File ".../projects/waitress/src/waitress/wasyncore.py", line 466, in handle_read_event
    self.handle_read()
  File ".../projects/waitress/src/waitress/channel.py", line 156, in handle_read
    data = self.recv(self.adj.recv_bytes)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../projects/waitress/src/waitress/wasyncore.py", line 409, in recv
    def recv(self, buffer_size):
    
  File ".../projects/waitress/.venv/lib/python3.12/site-packages/coverage/collector.py", line 252, in lock_data
    self.data_lock.acquire()
  File ".../projects/waitress/tests/test_functional.py", line 43, in sigterm
    sys.exit(0)
SystemExit: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File ".../projects/waitress/tests/test_functional.py", line 33, in start_server
    svr(app, queue, **kwargs).run()
  File ".../projects/waitress/src/waitress/server.py", line 331, in run
    self.task_dispatcher.shutdown()
  File ".../projects/waitress/src/waitress/task.py", line 118, in shutdown
    def shutdown(self, cancel_pending=True, timeout=5):
    
  File ".../projects/waitress/.venv/lib/python3.12/site-packages/coverage/collector.py", line 252, in lock_data
    self.data_lock.acquire()
KeyboardInterrupt

This is how it looks in CI, until it times out:

Screenshot 2024-11-14 at 20 20 54

I myself develop on macOS (M1 MacBook Pro) and have not been able to reproduce the issue at all locally. Turning coverage off in CI runs made the issue go away, so I did some testing:

  • I started by downgrading to 7.5.4 - hung
  • Downgraded to 7.4.4 and did not hang
  • Then slowly worked myself back up to newest version that works which is 7.5.3.

Pylons/waitress#454

Shows the various MR's and contains the action runs so you can view them.

To Reproduce
How can we reproduce the problem? Please be specific. Don't link to a failing CI job. Answer the questions below:

  1. What version of Python are you using?
    • Python 3.9
    • Python 3.10
    • Python 3.11
    • Python 3.12
    • Python 3.13
  2. What version of coverage.py shows the problem? The output of coverage debug sys is helpful.
    • 7.6.5
    • 7.5.5
  3. What versions of what packages do you have installed? The output of pip freeze is helpful.
    • coverage==7.6.5
    • iniconfig==2.0.0
    • packaging==24.2
    • pip==24.3.1
    • pluggy==1.5.0
    • pytest==8.3.3
    • pytest-cov==6.0.0
  4. What code shows the problem? Give us a specific commit of a specific repo that we can check out. If you've already worked around the problem, please provide a commit before that fix.
  5. What commands should we run to reproduce the problem? Be specific. Include everything, even git clone, pip install, and so on. Explain like we're five!

This is a race condition, it may or may not happen. I have been unable to reproduce it outside of CI/CD. Seems to happen fairly often, rerunning jobs will usually allow them to succeed.

Expected behavior

No deadlock/hang while running the test suite with newer versions of coverage.

Additional context

This is a race condition. I'm sorry, I haven't been able to reproduce it at all locally so I can't provide anymore data or debug information.

@digitalresistor digitalresistor added the bug Something isn't working label Nov 15, 2024
@digitalresistor
Copy link
Author

I think this may be related to a workaround that we had in the tests to make sure coverage would write output:

def try_register_coverage():  # pragma: no cover
    # Hack around multiprocessing exiting early and not triggering coverage's
    # atexit handler by always registering a signal handler

    if "COVERAGE_PROCESS_START" in os.environ:

        def sigterm(*args):
            sys.exit(0)

        signal.signal(signal.SIGTERM, sigterm)

This was originally added for coverage version 5.x.

Removing this works to fix the hang. My guess is that the order that the signal handlers are being run in is a random order, hence the inability to easily reproduce this issue.

@nedbat
Copy link
Owner

nedbat commented Nov 15, 2024

So should we close this as not a bug?

@nedbat nedbat added the question Further information is requested label Nov 15, 2024
@digitalresistor
Copy link
Author

Maybe?

While it solves my issue, and thus I would be fine with it being closed, if someone does register a signal handler wouldn't this race condition still potentially exist causing coverage's attempt to take a lock hang the process when it receives a SIGTERM?

@nedbat
Copy link
Owner

nedbat commented Nov 15, 2024

Do you have a solution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants