Skip to content

Conversation

@ccam80
Copy link

@ccam80 ccam80 commented Jan 4, 2026

Description

There is a race condition in the CUDA simulator, specifically in the swapped_cuda_module context manager.

I use the simulator for quick-running CI to avoid using up precious free GPU minutes. Occasionally, I get this error:

AttributeError: tid=[0, 13, 0] ctaid=[0, 0, 0]: module 'numba.cuda' has no attribute 'local'

It is raised from a different thread each time. The error arose more commonly after I began allocating arrays in a small helper function in its own module. The error is similar to the one raised in numba/numba#1844.

Each thread in the simulator is a threading.Thread object, so they share memory. Every time a device function is called, it is wrapped in this context manager:

@contextmanager
def swapped_cuda_module(fn, fake_cuda_module):
from numba import cuda
fn_globs = fn.__globals__
# get all globals that is the "cuda" module
orig = dict((k, v) for k, v in fn_globs.items() if v is cuda)
# build replacement dict
repl = dict((k, fake_cuda_module) for k, v in orig.items())
# replace
fn_globs.update(repl)
try:
yield
finally:
# revert
fn_globs.update(orig)

Race:

Thread A and Thread B are executing device functions in the same python module. They don't need to be the same function. They must be in a separate file from the kernel definition, as the kernel replaces references on entry, run all threads, and restores only after all threads have exited.

  1. Thread A launches and swaps numba.cuda for fake_cuda, yields.
  2. Thread B launches and gets orig = {} and repl = {}, as no references to cuda exist in it's __globals__ dict. Thread B yields.
  3. Thread A exits, replacing fake_cuda with numba.cuda
  4. Thread B calls e.g. cuda.local.array, and sees replaced reference to numba.cuda. local is not imported as part of numba.cuda when NUMBA_ENABLE_CUDASIM==1, so the error is thrown.

MWE

The Gist below contains a script that reliably causes the error on my machine. It takes ~200s to hit the race on my machine, typically, so I have not added it to the test suite. It does seem to fail faster on xdist, but it has a very long runtime when it doesn't fail.

Reproducer
Place all three files in the same directory, and run cudasim_race_mwe.

Fix

This PR implements a per-module lock and reference count, so that the first entrance to the context for a module replaces cuda -> fake_cuda, and the last thread to exit restores fake_cuda -> cuda. There may be a performance hit associated for simulated kernels with many device function calls from many modules, but this should be small, as all threads except for the first entrant and last exit perform a single integer comparison and increment/decrement an integer counter under the lock. The short "benchmark" run in the MWE did not change duration between the patched and unpatched versions on my machine.

@contextmanager
def swapped_cuda_module(fn, fake_cuda_module):
    from numba import cuda
    fn_globs = fn.__globals__
    gid = id(fn_globs)

    # Use a lock per-modules to avoid cross-locking other modules
    lock = _globals_locks[gid]

    with lock:
        # Scan and replace globals with fake module on first entrance only
        if _swap_refcount[gid] == 0:
            orig = {k: v for k, v in fn_globs.items() if v is cuda}
            _swap_orig[gid] = orig
            for k in orig:
                fn_globs[k] = fake_cuda_module

        # Increment the reference counter on every entrance
        _swap_refcount[gid] += 1
    try:
        yield
    finally:
        with lock:
            # Decrement "number of modules using fake CUDA" counter on exit
            _swap_refcount[gid] -= 1

            # Last thread to leave the context restores real cuda
            if _swap_refcount[gid] == 0:
                fn_globs.update(_swap_orig.pop(gid))
                del _swap_refcount[gid]
                del _globals_locks[gid]

ccam80 added 2 commits January 4, 2026 20:38
The prior fix scanned each modules entire globals dict under lock on every run, and all modules shared a lock. This update only scans the globals dict on first entry for a module. Additionally, each module has it's own lock, so a thread holding the lock in one module doesn't affect the launch of a thread for a function in another module.
Copilot AI review requested due to automatic review settings January 4, 2026 23:38
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 4, 2026

Greptile Summary

This PR fixes a critical race condition in the CUDA simulator's swapped_cuda_module context manager. The original implementation had a TOCTOU vulnerability where concurrent threads could interfere with each other's module swapping, causing AttributeError when accessing cuda.local and other attributes.

Key changes:

  • Added per-module locking using defaultdict(threading.Lock) indexed by module globals ID
  • Implemented reference counting to track active threads per module
  • Only the first thread entering swaps cudafake_cuda, only the last thread exiting restores it
  • Protected lock creation with _locks_register_lock to prevent concurrent lock creation race
  • Intentionally keeps per-module locks alive for the session to avoid deletion race

The fix correctly handles the race scenario described in the PR where Thread A and B execute device functions simultaneously. The reference counting ensures atomicity of module swapping operations while allowing concurrent execution within the same module.

Confidence Score: 5/5

  • This PR is safe to merge - it fixes a critical race condition with a well-designed thread-safe solution
  • The implementation demonstrates careful evolution through multiple commits addressing edge cases. The per-module locking strategy with reference counting is a textbook solution for this type of TOCTOU problem. Lock hierarchy prevents deadlocks, and the decision to keep locks alive avoids deletion races. No logical errors or security issues found.
  • No files require special attention

Important Files Changed

Filename Overview
numba_cuda/numba/cuda/simulator/kernelapi.py Fixed race condition in swapped_cuda_module by adding per-module locks and reference counting - implementation is thread-safe and correct

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a race condition in the CUDA simulator's swapped_cuda_module context manager that occurs when multiple simulated threads simultaneously call device functions from the same Python module. The fix implements per-module locks and reference counting to ensure thread-safe swapping and restoration of module globals.

Key Changes:

  • Added per-module locking mechanism using defaultdict(threading.Lock) to synchronize access to module globals
  • Implemented reference counting to track the number of active threads using the fake CUDA module in each Python module
  • Modified the swap logic so only the first entering thread performs the swap, and only the last exiting thread performs the restoration

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <[email protected]>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. numba_cuda/numba/cuda/simulator/kernelapi.py, line 493-504 (link)

    logic: race condition: defaultdict access is not thread-safe when key doesn't exist

    when multiple threads simultaneously call device functions from the same module for the first time, they race at line 504. threads can create different lock objects for the same gid, defeating the per-module locking

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

ccam80 added 3 commits January 6, 2026 12:04
The previous commit introduced a global lock on creating the per-module lock. This prevented concurrent creation of locks causing one thread to have a different lock to others, and modifying `fn_globs` or `_swap_refcount` in a race with other threads. Implementing this exposed yet another race: a thread could delete the lock from `_globals_locks` while another thread was already waiting at the entrance to the first `with lock:` statement. There is no need to delete a module's lock during runtime, so this commit simply removes the `del _global_locks[gid]`` statement.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant