Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

thread count keeps growing using ddtrace==2.14.2 #11006

Open
daillouf opened this issue Oct 11, 2024 · 12 comments
Open

thread count keeps growing using ddtrace==2.14.2 #11006

daillouf opened this issue Oct 11, 2024 · 12 comments
Assignees
Labels
Profiling Continous Profling

Comments

@daillouf
Copy link

Hello everyone,

we are using ddtrace in production on our celery workers, and those seem to be creating extra threads

Image

we had the issue when we initially upgraded all our unpinned dependencies, so we rolled back and I tested it one by one, the celery version is 5.3.6, and we also use celery-beats, and our worker pool should be multiprocessing.

our python version is 3.11 by the way,

let me know if i can give additional details or if you can’t reproduce,

Regards

@sanchda
Copy link
Contributor

sanchda commented Oct 11, 2024

👋 thank you for the report!

Can you confirm the ddtrace version which is causing issues?

Also, I can't quite see the top of the dashboard, but it looks like this is a view of Docker Threads--is that correct? Would you be willing to share whether there is also an increase in the total number of processes (PIDs) in the container? This might help us narrow what category of behaviors to look at.

@inikolaev
Copy link

Not sure if related, but after upgrading to Python 3.12 and ddtrace==2.14.2 with profiling enabled I started to observe increased CPU utilization and profiler has been showing thousands of exceptions being thrown.

After I disabled profiling the CPU consumption has decreased:
Image

Thrown exceptions during same time frame:
Image

Image
Image

@mabdinur mabdinur added the Profiling Continous Profling label Oct 22, 2024
@micahjz
Copy link

micahjz commented Dec 17, 2024

I'm also observing the same CPU utilization and thrown exception symptoms. The thrown exceptions symptom I'm seeing in my setup is identical-- it's similarly being attributed to _worker at thread.py:L#89 and the _queue.Empty exception type.

I've tried various different ddtrace library versions (2.3.0, 2.8.0, 2.14.0, and 2.17.0) and none seem to have fixed. I've also tried setting DD_PROFILING_STACK_V2_ENABLED=false per other recommendations, and experimented with DD_PROFILING_EXPORT_LIBDD_ENABLED=false as well.

Looking for any recommendation or resolution, thanks!

@sanchda
Copy link
Contributor

sanchda commented Dec 17, 2024

@micahjz thank you for the report. I received another report about this today; investigating.

Just to double-check, when you wrote

DD_PROFILING_STACK_V2_ENABLED=false

Did you mean

DD_PROFILING_STACK_V2_ENABLED=true?

Generally, for overhead-related issues our recommendation has been to enable the "stack v2" feature. That said, stack v2 does not yet support exception profiles, so part of the symptom will be obscured by this feature.

@micahjz
Copy link

micahjz commented Dec 17, 2024

Hi @sanchda .

Ah perhaps I had misread the recommendation in the ticket here #9205 to address segmentation faults.

I've re-enabled with DD_PROFILING_STACK_V2_ENABLED=true. The thrown exceptions are gone as you indicated. Although the performance doesn't degrade significantly, I still see some performance degradations switching to ddtrace==2.17.0 with DD_PROFILING_STACK_V2_ENABLED=true.

The performance degradation symptoms are:

Image
Image
Image
Image

This is using python 3.12.7 in a testing environment with no traffic. At the :53 minute mark, we switch to use ddtrace==2.17.0 with DD_PROFILING_STACK_V2_ENABLED=true.

From image top to bottom,

  • higher user and system CPU utilization. It's hard to see, but the system CPU utilization changes from <10ms to ~50ms, an almost 5x increase.
  • increase in overall CPU utilization percentage from 2% to 3-4% (despite no traffic for this service in the testing environment)
  • increase in context switches, especially involuntary
  • increase in p90 latency, which is probably measuring various spans and health checks etc.

@sanchda
Copy link
Contributor

sanchda commented Dec 17, 2024

@micahjz thank you for checking it out and the comprehensive report. I think there might be something actionable here, but I need to run some of my own experiments to see what I can confirm.

Just to verify--do you know whether you're observing the issue where the total thread count is increasing when profiling is enabled?

@micahjz
Copy link

micahjz commented Dec 17, 2024

I don't believe I'm seeing the total thread count symptom. Thanks for the assistance here @sanchda!

@sanchda
Copy link
Contributor

sanchda commented Dec 17, 2024

@micahjz are there any details you can share about the structure of your application? Is it using threads? Tasks? One short-lived thread per request? One short-lived ConcurrentFuture worker per request? A static worker pool? A single process? Any insight will help us investigate.

To be clear, generally speaking we're pretty comfortable having a small overhead--like the kind you observe--for CPU metrics. Even an increase in involuntary context switches is something we might be able to explain and admit.

I'm very confused by the 2-3ms increase in p90 latency for an idle process. I can't really explain that. I doubt you're running a configuration where you have a CPU quota tuned very carefully to the baseline of an idle process, so it's not like the extra CPU time is time-slicing (e.g., at the level of the Linux scheduler) with the request handler. The v2 stack sampler primarily runs in a separate native thread, so a lot of this overhead does not serialize with the cpython runtime (unlike the "v1" stack sampler, which needed to take the GIL in order to perform a sample).

@micahjz
Copy link

micahjz commented Dec 17, 2024

The application is a FastAPI / uvicorn web server within a single process. It uses asyncio as the event loop. Each endpoint is an async function, and every request should be handled by a short-lived coroutine. We generally don't use any multithreading. Infrequently (and not in an idle server) we use the asyncio run_in_executor functionality, which will run something in a separate thread within a ThreadPoolExecutor. Otherwise, everything should run within a single process and single thread.

@micahjz
Copy link

micahjz commented Dec 17, 2024

That said, the service page on datadog for this service in our testing environment indicates a sum total of ~400 threads for the 3 running pods

@micahjz
Copy link

micahjz commented Dec 18, 2024

@sanchda I wanted to add this clarification just in case I wasn't clear. In the details and screenshots above, we were already running with profiling enabled. At the 53 minute mark, we change from using ddtrace==1.8.0 to ddtrace==2.17.0 with DD_PROFILING_STACK_V2_ENABLED=true. When you say some overhead is expected, I want to make sure its clear that we were already running with profiling enabled, we were just upgrading the version of ddtrace. I'm not surprised there's overhead running profiling, I'm surprised that there's this additional degree of overhead to run profiling when upgrading from ddtrace==1.8.0 to ddtrace==2.17.0. Is this additional overhead penalty expected?

@sanchda
Copy link
Contributor

sanchda commented Dec 18, 2024

I want to make sure its clear that we were already running with profiling enabled

Appreciate the clarification--thank you--I misunderstood the original message to mean that the overhead was relative to baseline.

No, this is very much not our expectation for the stack v2 feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Profiling Continous Profling
Projects
None yet
Development

No branches or pull requests

5 participants