Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WaitHandleWait profiler #5814

Open
verdie-g opened this issue Jul 26, 2024 · 8 comments
Open

WaitHandleWait profiler #5814

verdie-g opened this issue Jul 26, 2024 · 8 comments
Assignees
Labels
area:profiler Issues related to the continous-profiler

Comments

@verdie-g
Copy link
Contributor

Is your feature request related to a problem? Please describe.

Thread pool starvation is a very common performance issue in the services of my company. They are usually caused by sync-over-async. It can be relatively easy to find them when it's a continuous issue, but it gets tricky when it occurs the callback of a 15 minutes timer.

Describe the solution you'd like

In .NET 9+, I added an event (dotnet/runtime#94737) that is emitted whenever a managed thread is blocked (Monitor.Wait, Task.Wait, MRES.Wait, SemaphoreSlim.Wait, ...). It would be awesome if this event was collected just like the exceptions, allocations, contention profilers.

One annoying this about this event is that most occurrences will usually happen in non-thread-pool threads that spend most of their time blocking. For that, I opened dotnet/runtime#102326.

Describe alternatives you've considered

Currently, I took super long traces of the WaitHandleWait event and try to find the issue in PerfView but it's not reliable.

@gleocadie
Copy link
Collaborator

Hi @verdie-g

thanks for the FR, that's really interesting. Thread starvation is a kind of things we are trying to detect and report in our UI.
Currently, we push a metric on the number of managed threads which can help to detect a threadpool starvation (continuously increasing number of threads) but nothing more (no callstacks, no additional info).
We want to add more info and thanks to your PR we could add another level of information (thanks for that :) ).

We can collect this event and it will be done for all managed threads. As of today, the filter is done in the UI (using the name/id...).

Just to be sure, and make this useful for everyone:
As a user, how would see that information usefully displayed to you (and gives you actionable ways to investigate/fix it) ? (callstacks, metrics, monitors...)

We'll let you know about the status of this FR.
Thanks for the proposition.

@gleocadie gleocadie added the area:profiler Issues related to the continous-profiler label Jul 31, 2024
@verdie-g
Copy link
Contributor Author

Just to be sure, and make this useful for everyone:
As a user, how would see that information usefully displayed to you

I would like to see a flamegraph of the waits just like the contention.

Ideally, the unit would but the wait time. The number of events can be extremely misleading. Unlike the contention event, it was decided not to include it in the wait event. In my custom perfview, I use the thread id of the event to match the start and the stop.

As a metric, the total wait time could also be interesting. For example, when a spike of latency is observed between two services but the calling service has a spike of wait time, it could help understand that the issue is not in the called service.

@gleocadie
Copy link
Collaborator

Just to be sure, and make this useful for everyone:
As a user, how would see that information usefully displayed to you

I would like to see a flamegraph of the waits just like the contention.

Ideally, the unit would but the wait time. The number of events can be extremely misleading. Unlike the contention event, it was decided not to include it in the wait event. In my custom perfview, I use the thread id of the event to match the start and the stop.

As a metric, the total wait time could also be interesting. For example, when a spike of latency is observed between two services but the calling service has a spike of wait time, it could help understand that the issue is not in the called service.

Thanks, I'll create a FR on our backlog and we'll have to plan it.

@gleocadie gleocadie self-assigned this Aug 1, 2024
@chrisnas
Copy link
Contributor

chrisnas commented Dec 10, 2024

@verdie-g : I'm starting to work on the possible integration of these new events into our contention profiler.

Is it expected that "noisy" very short waits happen?
Image
For example, I'm computing the wait duration based on the Stop - Start timestamps and I can see these AFTER real wait

@verdie-g
Copy link
Contributor Author

Is it expected that "noisy" very short waits happen?

Do you have the stack traces for those? There are many code paths, internal to the runtime, that could lead to a wait event to be emitted. See this dotnet/runtime#94737 (comment). Though, in my investigations I've never seen them, or maybe they were filtered out by my tooling.

@chrisnas
Copy link
Contributor

Is it expected that "noisy" very short waits happen?

Do you have the stack traces for those? There are many code paths, internal to the runtime, that could lead to a wait event to be emitted. See this dotnet/runtime#94737 (comment). Though, in my investigations I've never seen them, or maybe they were filtered out by my tooling.

Here is the source code of the test application:

using System.Diagnostics;
using System.Runtime.InteropServices;

class Program
{
    static AutoResetEvent autoResetEvent = new AutoResetEvent(false);
    static ManualResetEvent manualResetEvent = new ManualResetEvent(false);
    static Mutex mutex = new Mutex();
    static Semaphore semaphore = new Semaphore(1, 1);

    static void Main()
    {
        Console.WriteLine($"pid = {Process.GetCurrentProcess().Id}");
        Console.WriteLine("Press Enter to start the threads...");
        Console.ReadLine();

        // Create and start threads for different synchronization scenarios
        Console.WriteLine();
        Console.WriteLine("Starting threads...");

        Stopwatch sw = new Stopwatch();
        sw.Start();

        // start a thread that will own the mutex and the semaphore
        var owningThread = new Thread(OwningThread);
        owningThread.Start();
        Thread.Sleep(5);

        var autoResetEventThread = new Thread(AutoResetEventThread);
        autoResetEventThread.Start();

        var manualResetEventThread = new Thread(ManualResetEventThread);
        manualResetEventThread.Start();

        var mutexThread = new Thread(MutexThread);
        mutexThread.Start();

        var semaphoreThread = new Thread(SemaphoreThread);
        semaphoreThread.Start();

        // wait for all threads to finish
        owningThread.Join();
        autoResetEventThread.Join();
        manualResetEventThread.Join();
        mutexThread.Join();
        semaphoreThread.Join();

        sw.Stop();
        Console.WriteLine("___________________________________________");
        Console.WriteLine($"Duration = {sw.ElapsedMilliseconds} ms");

        Thread.Sleep(2);
    }

    static void OwningThread()
    {
        Console.WriteLine($"    [{GetCurrentThreadId(), 8}] Start to hold resources");
        Console.WriteLine("___________________________________________");
        mutex.WaitOne();
        semaphore.WaitOne();

        Thread.Sleep(6000);
        Console.WriteLine();
        Console.WriteLine("    Release resources");

        mutex.ReleaseMutex();
        semaphore.Release(1);
        autoResetEvent.Set();
        manualResetEvent.Set();
    }

    static void AutoResetEventThread()
    {
        Console.WriteLine($"    [{GetCurrentThreadId(), 8}] waiting for AutoResetEvent...");
        autoResetEvent.WaitOne();
        Console.WriteLine("    <-- AutoResetEvent");
    }

    static void ManualResetEventThread()
    {
        Console.WriteLine($"    [{GetCurrentThreadId(), 8}] waiting for ManualResetEvent...");
        manualResetEvent.WaitOne();
        Console.WriteLine("    <-- ManualResetEvent");
    }

    static void MutexThread()
    {
        Console.WriteLine($"    [{GetCurrentThreadId(), 8}] waiting for Mutex...");
        mutex.WaitOne();
        Console.WriteLine("    <-- Mutex");
    }

    static void SemaphoreThread()
    {
        Console.WriteLine($"    [{GetCurrentThreadId(), 8}] waiting for Semaphore");
        semaphore.WaitOne();
        Console.WriteLine("    <-- Semaphore");
    }

    [DllImport("kernel32.dll")]
    private static extern uint GetCurrentThreadId();
}

@verdie-g
Copy link
Contributor Author

I removed the Console.Read and ran

dotnet trace collect --clrevents waithandle --clreventlevel verbose -- Test.exe

in the generated nettrace I have 5 events lasting ~6 seconds, which seems to be expected.

Image

@chrisnas
Copy link
Contributor

Interesting: they appear only with the call to Console.ReadLine...
Thanks for the test @verdie-g :^)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:profiler Issues related to the continous-profiler
Projects
None yet
Development

No branches or pull requests

3 participants