Race condition leading to hanging tests using coverage >=7.5.5 #1892

digitalresistor · 2024-11-15T03:31:49Z

Describe the bug

On the waitress project we use coverage along with pytest-cov to compute coverage on all runs. Most recently we received a new contribution that fired CI across the test matrix, which included hanging in tests/test_functional.py. These tests spin up a server (with threads) using multiprocessing.

The developer who was adding new changes caught the issue and provided a stack trace when they hit Ctrl+C due to the test suite hanging:

Pylons/waitress#446 (comment)

Copied in its entirety here:

platform linux -- Python 3.12.7, pytest-8.3.3, pluggy-1.5.0
rootdir: .../projects/waitress
configfile: setup.cfg
testpaths: tests
plugins: cov-5.0.0
collected 796 items                                                                                                                                                                                                                                                                                                                                                                         

tests/test_adjustments.py .................................................                                                                                                                                                                                                                                                                                                           [  6%]
tests/test_buffers.py ....................................................                                                                                                                                                                                                                                                                                                            [ 12%]
tests/test_channel.py .........................................................................................................................                                                                                                                                                                                                                                       [ 27%]
tests/test_functional.py ...................................................................................^CTraceback (most recent call last):
  File ".../projects/waitress/src/waitress/server.py", line 325, in run
    self.asyncore.loop(
  File ".../projects/waitress/src/waitress/wasyncore.py", line 245, in loop
    poll_fun(timeout, map)
  File ".../projects/waitress/src/waitress/wasyncore.py", line 183, in poll
    read(obj)
  File ".../projects/waitress/src/waitress/wasyncore.py", line 104, in read
    obj.handle_read_event()
  File ".../projects/waitress/src/waitress/wasyncore.py", line 466, in handle_read_event
    self.handle_read()
  File ".../projects/waitress/src/waitress/channel.py", line 156, in handle_read
    data = self.recv(self.adj.recv_bytes)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../projects/waitress/src/waitress/wasyncore.py", line 409, in recv
    def recv(self, buffer_size):
    
  File ".../projects/waitress/.venv/lib/python3.12/site-packages/coverage/collector.py", line 252, in lock_data
    self.data_lock.acquire()
  File ".../projects/waitress/tests/test_functional.py", line 43, in sigterm
    sys.exit(0)
SystemExit: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File ".../projects/waitress/tests/test_functional.py", line 33, in start_server
    svr(app, queue, **kwargs).run()
  File ".../projects/waitress/src/waitress/server.py", line 331, in run
    self.task_dispatcher.shutdown()
  File ".../projects/waitress/src/waitress/task.py", line 118, in shutdown
    def shutdown(self, cancel_pending=True, timeout=5):
    
  File ".../projects/waitress/.venv/lib/python3.12/site-packages/coverage/collector.py", line 252, in lock_data
    self.data_lock.acquire()
KeyboardInterrupt

This is how it looks in CI, until it times out:

I myself develop on macOS (M1 MacBook Pro) and have not been able to reproduce the issue at all locally. Turning coverage off in CI runs made the issue go away, so I did some testing:

I started by downgrading to 7.5.4 - hung
Downgraded to 7.4.4 and did not hang
Then slowly worked myself back up to newest version that works which is 7.5.3.

Pylons/waitress#454

Shows the various MR's and contains the action runs so you can view them.

To Reproduce
How can we reproduce the problem? Please be specific. Don't link to a failing CI job. Answer the questions below:

What version of Python are you using?
- Python 3.9
- Python 3.10
- Python 3.11
- Python 3.12
- Python 3.13
What version of coverage.py shows the problem? The output of coverage debug sys is helpful.
- 7.6.5
- 7.5.5
What versions of what packages do you have installed? The output of pip freeze is helpful.
- coverage==7.6.5
- iniconfig==2.0.0
- packaging==24.2
- pip==24.3.1
- pluggy==1.5.0
- pytest==8.3.3
- pytest-cov==6.0.0
What code shows the problem? Give us a specific commit of a specific repo that we can check out. If you've already worked around the problem, please provide a commit before that fix.
- Issue exists on main on https://github.com/Pylons/waitress
- Commit sha1: Pylons/waitress@23ac524
What commands should we run to reproduce the problem? Be specific. Include everything, even git clone, pip install, and so on. Explain like we're five!
- python3 -mvenv toxcmd
- ./toxcmd/bin/pip install -U tox
- git clone https://github.com/Pylons/waitress.git
- ./toxcmd/bin/tox -e py

This is a race condition, it may or may not happen. I have been unable to reproduce it outside of CI/CD. Seems to happen fairly often, rerunning jobs will usually allow them to succeed.

Expected behavior

No deadlock/hang while running the test suite with newer versions of coverage.

Additional context

This is a race condition. I'm sorry, I haven't been able to reproduce it at all locally so I can't provide anymore data or debug information.

The text was updated successfully, but these errors were encountered:

digitalresistor · 2024-11-15T04:22:18Z

I think this may be related to a workaround that we had in the tests to make sure coverage would write output:

def try_register_coverage():  # pragma: no cover
    # Hack around multiprocessing exiting early and not triggering coverage's
    # atexit handler by always registering a signal handler

    if "COVERAGE_PROCESS_START" in os.environ:

        def sigterm(*args):
            sys.exit(0)

        signal.signal(signal.SIGTERM, sigterm)

This was originally added for coverage version 5.x.

Removing this works to fix the hang. My guess is that the order that the signal handlers are being run in is a random order, hence the inability to easily reproduce this issue.

nedbat · 2024-11-15T13:51:50Z

So should we close this as not a bug?

digitalresistor · 2024-11-15T20:22:53Z

Maybe?

While it solves my issue, and thus I would be fine with it being closed, if someone does register a signal handler wouldn't this race condition still potentially exist causing coverage's attempt to take a lock hang the process when it receives a SIGTERM?

nedbat · 2024-11-15T23:18:07Z

Do you have a solution?

digitalresistor added the bug Something isn't working label Nov 15, 2024

nedbat added the question Further information is requested label Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition leading to hanging tests using coverage >=7.5.5 #1892

Race condition leading to hanging tests using coverage >=7.5.5 #1892

digitalresistor commented Nov 15, 2024

digitalresistor commented Nov 15, 2024

nedbat commented Nov 15, 2024

digitalresistor commented Nov 15, 2024

nedbat commented Nov 15, 2024

Race condition leading to hanging tests using coverage >=7.5.5 #1892

Race condition leading to hanging tests using coverage >=7.5.5 #1892

Comments

digitalresistor commented Nov 15, 2024

digitalresistor commented Nov 15, 2024

nedbat commented Nov 15, 2024

digitalresistor commented Nov 15, 2024

nedbat commented Nov 15, 2024