How to understand thread_local of capture_error_mode in cuda graph capture

at graph — PyTorch 2.3 documentation, it says that:

  • capture_error_mode (str, optional) – specifies the cudaStreamCaptureMode for the graph capture stream. Can be “global”, “thread_local” or “relaxed”. During cuda graph capture, some actions, such as cudaMalloc, may be unsafe. “global” will error on actions in other threads, “thread_local” will only error for actions in the current thread, and “relaxed” will not error on actions.

at pytorch/test/test_cuda.py at main · pytorch/pytorch · GitHub for the test test_cuda_graph_error_options, there are two threads:
the main thread: do the cuda graph capture, and there’s memory allocation within the capture workload.
the second thread: this thread is created within graph capture, and there’s also memory allocation in the thread function body.

Within the test code,

  1. for capture_error_mode “relaxed”:
    it is expected that we are able to capture the graph successfully, because “relaxed” will not error on actions.

  2. for capture_error_mode “global”:
    it is expected that we are unable to allocate the memory in the second thread (and so the capture is not successful), because “global” will error on actions in other threads.

  3. for capture_error_mode “thread_local”:
    my expectation is that the main thread is error (and so the capture is not successful), because “thread_local” will only error for actions in the current thread. But the test shows that the actual result is that we are able to capture the graph successfully, why? thanks.