Global deadlock detector #1085

njsmith · 2019-06-05T03:25:09Z

We have an issue for fancier deadlock detection, and API support to make it more useful (#182). This is about a simpler issue: detecting when the entire program has deadlocked, i.e. no tasks are runnable or will ever be runnable again. This is not nearly as fancy, but it would catch lots of real-world deadlock cases (e.g. in tests), and is potentially wayyy simpler. In particular, I believe a Trio program has deadlocked if:

There are no runnable tasks
There are no registered timeouts
There are no tasks waiting on the IOManager
No-one is blocked in wait_all_tasks_blocked

(Did I miss anything?)

However, there is one practical problem: the EntryQueue task is always blocked in the IOManager, waiting for someone to call run_sync_soon.

Practical example of why this is important: from the Trio scheduler's point of view, run_sync_in_worker_thread puts a task to sleep, and then later a call to reschedule(...) magically appears through run_sync_soon. So... it's entirely normal to be in a state where the whole program looks deadlocked except for the possibility of getting a run_sync_soon, and the program actually isn't deadlocked. But, of course, 99% of the time, there is absolutely and definitely no run_sync_soon call coming. There's just no way for Trio to know that.

So I guess to make this viable, we would need some way to recognize the 99% of cases where there is no chance of a run_sync_soon. I think that means, we need to refactor TrioToken so that it uses an acquire/release pattern: you acquire the token only if you plan to call run_sync_soon, and then when you're done with it you explicitly close it.

This will break the other usage of TrioToken, which is that you can compare them with is to check if two calls to trio.run are in fact the same. Maybe this is not even that useful? If it is though then we should split it off into a separate class, so that the only reason to acquire the run_sync_soon-object is because you're going to call run_sync_soon.

Given that, I think we could implement this by extending the code at the top of the event loop like:

 if runner.runq:
     timeout = 0
 elif runner.deadlines:
     deadline, _ = runner.deadlines.keys()[0]
     timeout = runner.clock.deadline_to_sleep_time(deadline)
 else:
-    timeout = _MAX_TIMEOUT
+    if not runner.io_manager.has_waits() and not runner.tokens_outstanding and not runner.waiting_for_idle:
+        # Deadlock detected! Dump a stack tree and crash, maybe...?
+    else:
+        timeout = _MAX_TIMEOUT

This is probably super-cheap too, because we only do the extra checks when there are no runnable tasks or deadlines. No runnable tasks means we're either about to go to sleep for a while, so taking some extra time here is "free", or else that we're about to detect I/O, but if there's outstanding I/O then you should probably have a deadline set...

The text was updated successfully, but these errors were encountered:

njsmith · 2019-06-05T03:58:48Z

Minor tricky bit: our SIGINT handler needs to always hold a run_sync_soon handle. But I think we're ok: technically it is sort of true that no program is really deadlocked because the user could always hit control-C, but in practice if the only way to wake up your program is to hit control-C, then everyone considers that deadlocked. So we want to ignore this handle. And that's easy, because it's buried inside trio/_core anyway, so it can cheat and use a special handle that gets left out of the accounting.

njsmith · 2019-06-07T04:15:30Z

#456 raises some important questions about how we define "blocked" for purposes of introspective stuff like wait_all_tasks_blocked, the autojump clock, and a global deadlock detector; see in particular #456 (comment)

njsmith · 2019-06-11T01:24:46Z

I have a branch to experiment with splitting run_sync_soon off into a dedicated object that you open and close. The two big issues I've found so far:

BlockingTrioPortal assumes that TrioTokens are freely available, can be passed around, don't need closing, etc. So it's going to need to be reworked. We were already considering reworking it (Better usability for threads #810), so possibly that's the first thing to do?
I'm currently calling the thing TrioEntryHandle (returned by trio.hazmat.open_trio_entry_handle()). This is a super clunky name. Surely there's something better?

njsmith · 2019-09-11T04:01:01Z

Some more interesting issues that occurred to me, around the interaction between deadlock detection and run_sync_soon:

pytest-trio should eventually get the feature of setting a timeout on your tests, and cancelling the test if the timeout is exceeded (Add test timeout support pytest-trio#53), in order to avoid hanging from livelocks. This needs to use the system clock, not the trio clock, because you want the timeout to still work even when combined with the autojump clock. The simplest way to do this is to use a background thread. This background thread needs the capability to call back into Trio, to cancel things after the timeout expires. BUT, this should not prevent the deadlock detector from working! So pytest-trio needs a version of run_sync_soon that doesn't block the deadlock detector.
It's not ideal, but sometimes you need to use a __del__ method to trigger something inside Trio. This requires run_sync_soon. Again though, the existence of an object like this should not count as preventing deadlock – esp. since in most cases, if your program is deadlocked, the GC will probably never run again, so in fact the __del__ method can't be called. (And even if it could, then it would be super weird design to have the __del__ method be the thing that enables actual forward progress! Most likely it would just create a task that ran for a few moments to close stuff, and then exit and we'd be back in the deadlock.) So again, we probably need a version of run_sync_soon that doesn't block the deadlock detector.
- An example of where we'll need this inside Trio itself is for async generator cleanup, where Trio will register a cleanup hook called from __del__ that basically works like described above (see Implement a strategy for handling (async) generator cleanup #265)
OTOH, trio.to_thread.run_sync is like the canonical example of a case where we do want run_sync_soon to block the deadlock detector. But, even here...... if the thread uses trio.from_thread to re-enter Trio, then that means it's actually temporarily re-entering the zone where the deadlock detector can see what's happening. So ideally we should like... temporarily re-enable the deadlock detector, and then block it again when the trio.from_thread call returns. Subtle!

A simple solution would be to have a flag you pass when creating a TrioEntryTicket or whatever we call it, that controls whether this particular ticket blocks the deadlock detector or not. That would solve the first two cases. The last case is a bit more subtle – maybe it means we need a way to toggle an entry ticket's deadlock-detector-blocking-ness property (DDBNP) on and off?

Alternatively, I think we could use a trick: when you call trio.from_thread, it could close the ticket after re-entering Trio, and then when it returns to the original thread, it could create a new one and pass it back.

This sounds a bit elaborate, but there's actually some theoretical benefit too: I don't think you can have a correct system that relies on toggling the DDBNP from False -> True from outside Trio. There's an inherent race condition: if you're relying on setting it to True to block the deadlock detector from firing, then how do you know that the deadlock detector won't fire just before you set it to True? But if people have to go into Trio to get a deadlock-detector-blocking-ticket, then that automatically gives you a synchronization point.

Of course, if someone has a ticket with DDBNP = False, they can still use that to enter Trio and get a ticket with DDBNP = True, and that has exactly the same issues as toggling a single ticket in-place.

We don't really need to solve this right now – first we need to get the new entry ticket API in place, and then add deadlock detection, and then we can worry about whether toggling this is important.

njsmith · 2019-09-28T10:56:10Z

OK better idea!

Let's add inhibit_deadlock_detector=True/False as an argument to wait_task_rescheduled (default False). When you set it to True, it means that this task might wake up from stuff that Trio can't see, so the deadlock detector shouldn't fire.

(Or maybe call it could_deadlock=False/True?)

Then:

trio.to_thread can use a regular token that doesn't inhibit the deadlock detector, and instead pass inhibit_deadlock_detector=True when the task is blocked waiting for the thread; this automatically restricts the inhibition to the right moment
open_signal_receiver still probably wants an option to let you decide whether signal reception counts as something that can break a deadlock. But if it can, you still properly only want to inhibit the deadlock detector when a task is actually blocked waiting for a signal – not the whole time that the signal receiver exists!
simple __del__ calls automatically don't inhibit the detector
The IOManager code is simpler – they just pass inhibit_deadlock_detector=True when they block in wait_task_rescheduled, versus my original design where each IOManager had to reimplement the tracking of whether any tasks were blocked on IO.

So maybe we don't even need to replace TrioToken? We still probably want some way to manually inhibit the deadlock detector – a TrioToken can be used externally to spawn a whole new task, not just wake up an existing one. But this seems like a pretty rare event. So maybe we should leave TrioToken alone, and just document that if you're using it, you should think about whether to use with trio.hazmat.inhibit_deadlock_detector(): ... or similar.

smurfix · 2019-09-28T11:56:34Z

inhibit_deadlock_detector=True

or maybe external_event?

Anyway, the concept seems sound enough. However, I'd like to have the opposite option, i.e. the ability to mark a task as non-deadlock-breaking even if it is (or will shortly be) runnable.

Use case: a protocol's keepalive task which runs periodically and sends off an "I am alive" packet. As I understand it, without this option the existence of this task would be sufficient to inhibit deadlock detection.

njsmith · 2019-10-05T09:08:21Z

That's an interesting idea. I haven't thought too much about cases like a keepalive task. For now I've mostly focused on figuring out how to make sure this never gives false positives, since a false positive is "we just blew up your working program" while a false negative is "your broken program could potentially have gotten better diagnostics of how it was broken, but didn't". And I think the biggest value will be for small programs, like simple scripts and for unit tests; I guess large programs will generally have something somewhere that keeps running, even if some subsystem is deadlocked.

But... a small-but-annoying design problem I've been thinking about is, how should we prevent the wakeup-pipe task from blocking the deadlock detector. It's definitely solvable, since this is a special internal task with privileged access to Trio's internals. But right now it just blocks in wait_readable, and wait_readable is exposed to users and should normally inhibit the deadlock detector. We could add a secret internal version of wait_readable that doesn't inhibit the deadlock detector, but it's a bit inelegant to do that multiple times for the different IO managers and then use some hard-coded hack to let the wakeup-pipe task access it. And your suggestion would in fact be a way to solve this.

Another example where this would be useful would be python-trio/hip#125, where if you're using an HTTP library that silently uses background tasks to listen for pings on HTTP/2 and HTTP/3 connections, then having ever used HTTP/2 or HTTP/3 in your program could produce a situation where the deadlock detector never fires. (Depending on how patient the server is.) Not very nice!

I guess the implementation could be pretty trivial too... wait_task_rescheduled already has to mess around with the Task object. It would be trivial for it to do something like:

async def wait_task_rescheduled(..., inhibit_deadlock_detector=...):
    task = current_task()
    ...
    inhibit_deadlock_detector = (inhibit_deadlock_detector and task.can_inhibit_deadlock_detector)
    ...

njsmith · 2020-06-09T05:17:16Z

After #1588, I think actual deadlock detection could be as simple as:

# in run loop
idle_primed = False
if runner.waiting_for_idle:
    ...
elif runner.clock_autojump_threshold < timeout:
   ...
elif runner.external_wakeup_count == 0 and not runner.guest_mode:
    timeout = 0
    idle_primed = True
    idle_prime_type = IdlePrimedTypes.DEADLOCK

...

if idle_primed and idle:
    if idle_primed_type is IdlePrimedTypes.DEADLOCK:
        # ... crash with deadlock message ...

And runner.external_wakeup_count is maintained by a tool like trio.lowlevel.not_deadlocked:

@contextmanager
def not_deadlocked():
    task = trio.current_task()
    if not task.ignored_by_deadlock_detector:
        GLOBAL_RUN_STATE.runner.external_wakeup_count += 1
        try:
            yield
        finally:
            GLOBAL_RUN_STATE.runner.external_wakeup_count -= 1
    else:
        yield

IO primitives do with not_deadlocked(): await wait_task_rescheduled(...).

(Maybe it's also convenient allow a not_deadlocked=... kwarg to wait_task_rescheduled, as a shorthand? Or whatever name we end up using.)

We would also need to update the autojump clock so that if the autojump is triggered but there is no next deadline, then it runs a version of the deadlock-detected code. Though probably with a custom message.

A minor open question: what to do with open_signal_receiver. I feel like we want a kwarg to control whether it counts as a external wakeup source (which will need a name), and we also have to pick a default.

This is all looking pretty straightforward. I think the main remaining blockers are getting code to dump the task tree when a deadlock happens (#927), and some way to kill everything with a custom exception (maybe the KeyboardInterrupt refactors will give us this, e.g. #1537?).

njsmith · 2020-06-09T08:06:29Z

On further thought, going through the new idle_primed stuff is probably more convoluted than necessary. I think it would be enough to do something like this after all the timeout calculations:

if timeout == inf and not runner.external_wakeup_count and not runner.guest_mode:
    # deadlock detected

This also requires moving the timeout = min(max(0, timeout), _MAX_TIMEOUT) down below all this, and making the default timeout value inf instead of _MAX_TIMEOUT.

Ugh this will be so cool and it's so easy, I can taste it. But we need to sort out those blockers first.

njsmith added the potential API breaker label Jun 5, 2019

njsmith mentioned this issue Jun 5, 2019

Tools for finding scheduler-dependent "heisenbugs" #239

Open

This was referenced Jun 5, 2019

Provide a helper to dump the task tree / stack tree #927

Open

Should run_sync_in_worker_thread count as a blocked task for wait_all_tasks_blocked / autojump_clock? #456

Open

This was referenced Jun 7, 2019

Add test timeout support python-trio/pytest-trio#53

Open

Introduce performance measurements #604

Open

This was referenced Jun 11, 2019

[wip] closable run_sync_soon #1098

Closed

Better usability for threads #810

Open

njsmith mentioned this issue Jun 22, 2019

Proposal: make checkpoint_if_cancelled() sync-colored and rename it accordingly #961

Open

epellis mentioned this issue Jul 23, 2019

Deprecate BlockingTrioPortal in favor of from_thread.run and from_thread.run_sync #1122

Merged

njsmith mentioned this issue May 21, 2020

Idea: TreeVar (context variables that follow stack discipline) #1523

Open

smurfix mentioned this issue Jun 9, 2020

Deadlock detection #182

Open

njsmith mentioned this issue Jun 10, 2020

We need a mechanism to crash Trio with a given exception; how should it work? #1607

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Global deadlock detector #1085

Global deadlock detector #1085

njsmith commented Jun 5, 2019

njsmith commented Jun 5, 2019

njsmith commented Jun 7, 2019

njsmith commented Jun 11, 2019

njsmith commented Sep 11, 2019

njsmith commented Sep 28, 2019

smurfix commented Sep 28, 2019

njsmith commented Oct 5, 2019

njsmith commented Jun 9, 2020 •

edited

Loading

njsmith commented Jun 9, 2020

Global deadlock detector #1085

Global deadlock detector #1085

Comments

njsmith commented Jun 5, 2019

njsmith commented Jun 5, 2019

njsmith commented Jun 7, 2019

njsmith commented Jun 11, 2019

njsmith commented Sep 11, 2019

njsmith commented Sep 28, 2019

smurfix commented Sep 28, 2019

njsmith commented Oct 5, 2019

njsmith commented Jun 9, 2020 • edited Loading

njsmith commented Jun 9, 2020

njsmith commented Jun 9, 2020 •

edited

Loading