Skip to content

[BUG]: RuntimeError: dictionary changed size during iteration in _get_metas_to_propagate #16523

@hutchiko

Description

@hutchiko

Tracer Version(s)

datadog-lambda==8.122.0 ddtrace==4.3.1

Python Version(s)

AWS Lambda Python 3.12 runtime

Pip Version(s)

AWS Lambda Python 3.12 runtime

Bug Report

Started seeing intermittent exceptions thrown from dd-trace internals.

eg.

Traceback (most recent call last):
  File "/var/task/processor.py", line 181, in process
    enrichment, target_topic, api_sink_future = self.process_entity_update(entity)
                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/task/processor.py", line 116, in process_entity_update
    maybe_enrichment = self.enricher.enrich(entity, attempt=1)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/task/enricher/__init__.py", line 566, in enrich
    with tracer.trace("enrich") as span:
         ^^^^^^^^^^^^^^^^^^^^^^
  File "/var/task/ddtrace/_trace/tracer.py", line 648, in trace
    return self.start_span(
           ^^^^^^^^^^^^^^^^
  File "/var/task/ddtrace/_trace/tracer.py", line 517, in _start_span
    for k, v in _get_metas_to_propagate(context):
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/task/ddtrace/internal/utils/__init__.py", line 79, in _get_metas_to_propagate
    return [(k, v) for k, v in context._meta.items() if isinstance(k, str) and k.startswith("_dd.p.")]
                               ^^^^^^^^^^^^^^^^^^^^^
RuntimeError: dictionary changed size during iteration

and

RuntimeError: dictionary changed size during iteration
  File "/var/task/datadog_lambda/wrapper.py", line 188, in __call__
    self.response = self.func(event, context, **kwargs)
  File "/var/task/aws/lambdas/lambda_helpers.py", line 190, in __call__
    logger.info(f"Function Instance: {LAMBDA_INSTANCE_ID}")
  File "/var/task/ddtrace/contrib/internal/aws_lambda/patch.py", line 124, in __call__
    self.response = self.func(*args, **kwargs)
  File "/var/task/aws/lambdas/lambda_helpers.py", line 224, in __call__
    process_results = [(record, future.result()) for record, future in futures]
  File "/var/lang/lib/python3.12/concurrent/futures/_base.py", line 456, in result
    return self.__get_result()
  File "/var/lang/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/var/lang/lib/python3.12/concurrent/futures/thread.py", line 59, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/var/task/ddtrace/contrib/internal/futures/threading.py", line 43, in _wrap_execution
    return fn(*args, **kwargs)
  File "/var/task/aws/lambdas/lambda_helpers.py", line 208, in invoke_processor
    with tracer.trace("process"):
  File "/var/task/ddtrace/_trace/tracer.py", line 648, in trace
    return self.start_span(
  File "/var/task/ddtrace/_trace/tracer.py", line 517, in _start_span
    for k, v in _get_metas_to_propagate(context):
  File "/var/task/ddtrace/internal/utils/__init__.py", line 79, in _get_metas_to_propagate
    return [(k, v) for k, v in context._meta.items() if isinstance(k, str) and k.startswith("_dd.p.")]

This is a multi-threaded application. The first exception above is thrown in a child thread context, the second appears to be in the primary Lambda runtime thread.

This issue only started appearing when we started using the following pattern to force consistent sampling of traces across threads:

def ensure_trace_sampling() -> None:
    root_span = tracer.current_root_span()
    if root_span is not None:
        tracer.sample(root_span)

Every time we extract the tracer.current_trace_context() we now first call ensure_trace_sampling()

eg.

ensure_trace_sampling()
trace_context = capture_trace_context()
...
thread_pool_executor.submit(some_worker, trace_context, ...),
...

Since we've added this call to ensure_trace_sampling() we've started to see the above exceptions. Note we added this call based on the documentation at https://ddtrace.readthedocs.io/en/stable/advanced_usage.html#tracing-across-threads.

Some of our child threads also span new child threads so it's possible for the above pattern to be happening multiple times for a single root span across multiple threads. I suspect that is tracer.sample(root_span) is mutating the state that _get_metas_to_propagate is iterating across which leads to the concurrent iteration and mutation.

Anecdotally it appears that the more worker threads there are at play more likely it is for this issue to arise. We have several versions of this service processing different payload shapes. We only see these exceptions in the services that process complex shapes that require a larger number of threads.

Reproduction Code

No response

Error Logs

No response

Libraries in Use

No response

Operating System

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions