Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Celery task_postrun is called only after the span has finished in chains (due to a regression) #11479

Open
oranav opened this issue Nov 21, 2024 · 10 comments
Assignees

Comments

@oranav
Copy link
Contributor

oranav commented Nov 21, 2024

#10676 introduced a regression which causes the task_postrun signal to never be able to attach tags to a span when working with Celery canvases (chains, callbacks, chords etc).

When working with a chain for instance, the task always calls apply_async on its callback just before it finishes (before task_postrun is fired). With the changes introduced in #10676, the span is already finished at this point (and sent to the tracer), so adding new tags doesn't do anything, making the span degenerated.

This means that the following happens:

  1. task_prerun is called, creating a new span, and registering it under prerun_span
  2. The task is run (either successfully or not)
  3. Celery sees that there's a callback registered, and fires the next task in the chain
  4. Task.apply_async is invoked, but it's wrapped with _traced_apply_async_inner. Inside its finally block it sees that there's a prerun_span belonging to the worker span - therefore forcefully finishing it
  5. task_postrun is fired, and tries to attach tags to the span, but at this point it's doomed, since the span has already finished.

All in all I would recommend to revert #10676 and come up with a better implementation, or at least to remove the task_postrun part of it, since it's utterly broken.

Reproduce

tasks.py:

from celery import Celery, chain

app = Celery("tasks")


@app.task
def a():
    print("a")


@app.task
def b():
    print("b")

fire.py:

from tasks import a, b

(a.si() | b.si()).delay()

Run Celery with:

DD_TRACE_ENABLED=1 ddtrace-run celery -A tasks.app worker -l debug -P solo -c 1

Then fire the chain with:

python3 fire.py

You'll see the following debug logs in the Celery worker:

[2024-11-21 14:43:11,621: INFO/MainProcess] Task tasks.a[8d17b2e1-421d-410f-a81b-a49663fba490] received
...
prerun signal start task_id=8d17b2e1-421d-410f-a81b-a49663fba490
...
The task_postrun signal was not called, so manually closing span: name='celery.run' id=11656151137401636297 trace_id=137238467797017623115752292864617901008 parent_id=None service='celery-worker' resource='tasks.a' type='worker' start=1732192991.621946 end=None duration=None error=0 tags={'component': 'celery', 'runtime-id': '935bf9ad4d684312a4a8ab390bab4646', 'span.kind': 'consumer'} metrics={'_dd.measured': 1, 'process_id': 48412} links='' events=''
...
finishing span name='celery.run' id=11656151137401636297 trace_id=137238467797017623115752292864617901008 parent_id=None service='celery-worker' resource='tasks.a' type='worker' start=1732192991.621946 end=1732192991.624561 duration=0.002615 error=0 tags={'_dd.base_service': '', '_dd.p.dm': '-0', '_dd.p.tid': '673f2adf00000000', 'component': 'celery', 'language': 'python', 'runtime-id': '935bf9ad4d684312a4a8ab390bab4646', 'span.kind': 'consumer'} metrics={'_dd.measured': 1, '_dd.top_level': 1, '_dd.tracer_kr': 1.0, '_sampling_priority_v1': 1, 'process_id': 48412} links='' events='' (enabled:True)
...
[2024-11-21 14:43:11,625: INFO/MainProcess] Task tasks.a[8d17b2e1-421d-410f-a81b-a49663fba490] succeeded in 0.003730249998625368s: None
...
postrun signal task_id=8d17b2e1-421d-410f-a81b-a49663fba490

And you'll see that the first span doesn't have any celery.* tags, for example celery.state.

@wantsui wantsui self-assigned this Nov 21, 2024
@wantsui
Copy link
Collaborator

wantsui commented Nov 21, 2024

Thank you for the report and sorry for the behavior that the change had on your spans. The update was to fix a bug where we weren’t closing spans. I really appreciate the level of detail you provided on how to reproduce and your suggestion!

I will be taking a look to discuss with my team the best way to safely get the experience you’re looking for.

As a workaround - are you ok with pinning your version for now?

Apologies for the inconvenience!

@oranav
Copy link
Contributor Author

oranav commented Nov 26, 2024

@wantsui Yes, we have pinned the version to 2.14.1 for the time being. Thanks for the quick response!

@iherasymenko
Copy link

Hello @wantsui . I am wondering if there's any update on this. This issue also affects our organization.

@wantsui
Copy link
Collaborator

wantsui commented Dec 5, 2024

@iherasymenko - Thanks for your report! This is something I am working on!

As an update, I was able to get the same behavior when I run the code snippet in their own apps as described in this specific thread (11479), but I had some issues trying to get the dd-trace-py tests to replicate this exact behavior. This is more on how the Celery tests are set up though.

I'm reviewing your report in #11624 as part of the consideration for the fix. I'll report back once I have confidence that our tests correctly capture this scenario to avoid any issues.

I'll update both issues once I have another update.

@oranav
Copy link
Contributor Author

oranav commented Dec 16, 2024

The following is a workaround for this problem:

from celery import signals
from wrapt import wrap_function_wrapper

@signals.worker_init.connect
def workaround_ddtrace_bug(**_):
    def set_item_wrapper(wrapped, instance, args, kwargs):
        if args[0] == "prerun_span":
            return
        return wrapped(*args, **kwargs)

    wrap_function_wrapper(
        "ddtrace.internal.core",
        "set_item",
        set_item_wrapper,
    )

It neutralizes the registering of the current span under prerun_span, so _traced_apply_async_inner won't prematurely close it.

@wantsui
Copy link
Collaborator

wantsui commented Dec 18, 2024

As an update, I got the approvals I needed to get this merged. I'm also working to backport this fix to 2.18.0, 2.17.0, and 2.16.0, would any of these versions for your use case reported here, @oranav ?

@esauser
Copy link

esauser commented Dec 18, 2024

I need it on 18

@wantsui
Copy link
Collaborator

wantsui commented Dec 18, 2024

Thanks for letting me know, @esauser! I'm working on it over the next two days and reply to this issue once it's ready.

@oranav
Copy link
Contributor Author

oranav commented Dec 19, 2024

@wantsui Thanks. We're currently on 2.17.

wantsui added a commit that referenced this issue Dec 19, 2024
…chains scenario (#11498)

We've made a few changes to handle celery context recently, including:
#10676

In particular the goal of
#10676 was to handle a
scenario where a long running task may run into an exception, preventing
it from closing.

Unfortunately, this scenario did not account for cases where tasks are
chained and may not close until later.

See: #11479 and
#11624

With this PR, the sample app in
#11479 would attach the
celery specific span back to the root span.

I also need to add tests for the chains scenario.

Related to AIDM-494

## Checklist
- [x] PR author has checked that all the criteria below are met
- The PR description includes an overview of the change
- The PR description articulates the motivation for the change
- The change includes tests OR the PR description describes a testing
strategy
- The PR description notes risks associated with the change, if any
- Newly-added code is easy to change
- The change follows the [library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
- The change includes or references documentation updates if necessary
- Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist
- [x] Reviewer has checked that all the criteria below are met 
- Title is accurate
- All changes are related to the pull request's stated goal
- Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- Testing strategy adequately addresses listed risks
- Newly-added code is easy to change
- Release note makes sense to a user of the library
- If necessary, author has acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment
- Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)
github-actions bot pushed a commit that referenced this issue Dec 19, 2024
…chains scenario (#11498)

We've made a few changes to handle celery context recently, including:
#10676

In particular the goal of
#10676 was to handle a
scenario where a long running task may run into an exception, preventing
it from closing.

Unfortunately, this scenario did not account for cases where tasks are
chained and may not close until later.

See: #11479 and
#11624

With this PR, the sample app in
#11479 would attach the
celery specific span back to the root span.

I also need to add tests for the chains scenario.

Related to AIDM-494

## Checklist
- [x] PR author has checked that all the criteria below are met
- The PR description includes an overview of the change
- The PR description articulates the motivation for the change
- The change includes tests OR the PR description describes a testing
strategy
- The PR description notes risks associated with the change, if any
- Newly-added code is easy to change
- The change follows the [library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
- The change includes or references documentation updates if necessary
- Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist
- [x] Reviewer has checked that all the criteria below are met
- Title is accurate
- All changes are related to the pull request's stated goal
- Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- Testing strategy adequately addresses listed risks
- Newly-added code is easy to change
- Release note makes sense to a user of the library
- If necessary, author has acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment
- Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)

(cherry picked from commit e8aab65)
github-actions bot pushed a commit that referenced this issue Dec 19, 2024
…chains scenario (#11498)

We've made a few changes to handle celery context recently, including:
#10676

In particular the goal of
#10676 was to handle a
scenario where a long running task may run into an exception, preventing
it from closing.

Unfortunately, this scenario did not account for cases where tasks are
chained and may not close until later.

See: #11479 and
#11624

With this PR, the sample app in
#11479 would attach the
celery specific span back to the root span.

I also need to add tests for the chains scenario.

Related to AIDM-494

## Checklist
- [x] PR author has checked that all the criteria below are met
- The PR description includes an overview of the change
- The PR description articulates the motivation for the change
- The change includes tests OR the PR description describes a testing
strategy
- The PR description notes risks associated with the change, if any
- Newly-added code is easy to change
- The change follows the [library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
- The change includes or references documentation updates if necessary
- Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist
- [x] Reviewer has checked that all the criteria below are met
- Title is accurate
- All changes are related to the pull request's stated goal
- Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- Testing strategy adequately addresses listed risks
- Newly-added code is easy to change
- Release note makes sense to a user of the library
- If necessary, author has acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment
- Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)

(cherry picked from commit e8aab65)
github-actions bot pushed a commit that referenced this issue Dec 19, 2024
…chains scenario (#11498)

We've made a few changes to handle celery context recently, including:
#10676

In particular the goal of
#10676 was to handle a
scenario where a long running task may run into an exception, preventing
it from closing.

Unfortunately, this scenario did not account for cases where tasks are
chained and may not close until later.

See: #11479 and
#11624

With this PR, the sample app in
#11479 would attach the
celery specific span back to the root span.

I also need to add tests for the chains scenario.

Related to AIDM-494

## Checklist
- [x] PR author has checked that all the criteria below are met
- The PR description includes an overview of the change
- The PR description articulates the motivation for the change
- The change includes tests OR the PR description describes a testing
strategy
- The PR description notes risks associated with the change, if any
- Newly-added code is easy to change
- The change follows the [library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
- The change includes or references documentation updates if necessary
- Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist
- [x] Reviewer has checked that all the criteria below are met
- Title is accurate
- All changes are related to the pull request's stated goal
- Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- Testing strategy adequately addresses listed risks
- Newly-added code is easy to change
- Release note makes sense to a user of the library
- If necessary, author has acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment
- Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)

(cherry picked from commit e8aab65)
wantsui added a commit that referenced this issue Dec 19, 2024
…chains scenario [backport 2.18] (#11806)

Backport e8aab65 from #11498 to 2.18.

We've made a few changes to handle celery context recently, including:
#10676

In particular the goal of
#10676 was to handle a
scenario where a long running task may run into an exception, preventing
it from closing.

Unfortunately, this scenario did not account for cases where tasks are
chained and may not close until later.

See: #11479 and
#11624

With this PR, the sample app in
#11479 would attach the
celery specific span back to the root span.

I also need to add tests for the chains scenario.

Related to AIDM-494

## Checklist
- [x] PR author has checked that all the criteria below are met
- The PR description includes an overview of the change
- The PR description articulates the motivation for the change
- The change includes tests OR the PR description describes a testing
strategy
- The PR description notes risks associated with the change, if any
- Newly-added code is easy to change
- The change follows the [library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
- The change includes or references documentation updates if necessary
- Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist
- [x] Reviewer has checked that all the criteria below are met 
- Title is accurate
- All changes are related to the pull request's stated goal
- Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- Testing strategy adequately addresses listed risks
- Newly-added code is easy to change
- Release note makes sense to a user of the library
- If necessary, author has acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment
- Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)

Co-authored-by: wantsui <[email protected]>
@wantsui
Copy link
Collaborator

wantsui commented Dec 20, 2024

As an update, we got 2.18.1 out with the celery update: https://github.com/DataDog/dd-trace-py/releases/tag/v2.18.1 . Can you give that a try?

Unfortunately, we can't release patches for 2.17.x or 2.16.x due to the holidays and code freezes, so that'll be out in January instead. I'll update this thread once we have those other versions out in January. Apologies for the inconvenience!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants