-
Notifications
You must be signed in to change notification settings - Fork 481
Description
Tracer Version(s)
datadog-lambda==8.122.0 ddtrace==4.3.1
Python Version(s)
AWS Lambda Python 3.12 runtime
Pip Version(s)
AWS Lambda Python 3.12 runtime
Bug Report
Started seeing intermittent exceptions thrown from dd-trace internals.
eg.
Traceback (most recent call last):
File "/var/task/processor.py", line 181, in process
enrichment, target_topic, api_sink_future = self.process_entity_update(entity)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/var/task/processor.py", line 116, in process_entity_update
maybe_enrichment = self.enricher.enrich(entity, attempt=1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/var/task/enricher/__init__.py", line 566, in enrich
with tracer.trace("enrich") as span:
^^^^^^^^^^^^^^^^^^^^^^
File "/var/task/ddtrace/_trace/tracer.py", line 648, in trace
return self.start_span(
^^^^^^^^^^^^^^^^
File "/var/task/ddtrace/_trace/tracer.py", line 517, in _start_span
for k, v in _get_metas_to_propagate(context):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/var/task/ddtrace/internal/utils/__init__.py", line 79, in _get_metas_to_propagate
return [(k, v) for k, v in context._meta.items() if isinstance(k, str) and k.startswith("_dd.p.")]
^^^^^^^^^^^^^^^^^^^^^
RuntimeError: dictionary changed size during iteration
and
RuntimeError: dictionary changed size during iteration
File "/var/task/datadog_lambda/wrapper.py", line 188, in __call__
self.response = self.func(event, context, **kwargs)
File "/var/task/aws/lambdas/lambda_helpers.py", line 190, in __call__
logger.info(f"Function Instance: {LAMBDA_INSTANCE_ID}")
File "/var/task/ddtrace/contrib/internal/aws_lambda/patch.py", line 124, in __call__
self.response = self.func(*args, **kwargs)
File "/var/task/aws/lambdas/lambda_helpers.py", line 224, in __call__
process_results = [(record, future.result()) for record, future in futures]
File "/var/lang/lib/python3.12/concurrent/futures/_base.py", line 456, in result
return self.__get_result()
File "/var/lang/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/var/lang/lib/python3.12/concurrent/futures/thread.py", line 59, in run
result = self.fn(*self.args, **self.kwargs)
File "/var/task/ddtrace/contrib/internal/futures/threading.py", line 43, in _wrap_execution
return fn(*args, **kwargs)
File "/var/task/aws/lambdas/lambda_helpers.py", line 208, in invoke_processor
with tracer.trace("process"):
File "/var/task/ddtrace/_trace/tracer.py", line 648, in trace
return self.start_span(
File "/var/task/ddtrace/_trace/tracer.py", line 517, in _start_span
for k, v in _get_metas_to_propagate(context):
File "/var/task/ddtrace/internal/utils/__init__.py", line 79, in _get_metas_to_propagate
return [(k, v) for k, v in context._meta.items() if isinstance(k, str) and k.startswith("_dd.p.")]
This is a multi-threaded application. The first exception above is thrown in a child thread context, the second appears to be in the primary Lambda runtime thread.
This issue only started appearing when we started using the following pattern to force consistent sampling of traces across threads:
def ensure_trace_sampling() -> None:
root_span = tracer.current_root_span()
if root_span is not None:
tracer.sample(root_span)
Every time we extract the tracer.current_trace_context() we now first call ensure_trace_sampling()
eg.
ensure_trace_sampling()
trace_context = capture_trace_context()
...
thread_pool_executor.submit(some_worker, trace_context, ...),
...
Since we've added this call to ensure_trace_sampling() we've started to see the above exceptions. Note we added this call based on the documentation at https://ddtrace.readthedocs.io/en/stable/advanced_usage.html#tracing-across-threads.
Some of our child threads also span new child threads so it's possible for the above pattern to be happening multiple times for a single root span across multiple threads. I suspect that is tracer.sample(root_span) is mutating the state that _get_metas_to_propagate is iterating across which leads to the concurrent iteration and mutation.
Anecdotally it appears that the more worker threads there are at play more likely it is for this issue to arise. We have several versions of this service processing different payload shapes. We only see these exceptions in the services that process complex shapes that require a larger number of threads.
Reproduction Code
No response
Error Logs
No response
Libraries in Use
No response
Operating System
No response