at graph — PyTorch 2.3 documentation, it says that:
- capture_error_mode (str, optional) – specifies the cudaStreamCaptureMode for the graph capture stream. Can be “global”, “thread_local” or “relaxed”. During cuda graph capture, some actions, such as cudaMalloc, may be unsafe. “global” will error on actions in other threads, “thread_local” will only error for actions in the current thread, and “relaxed” will not error on actions.
at pytorch/test/test_cuda.py at main · pytorch/pytorch · GitHub for the test test_cuda_graph_error_options, there are two threads:
the main thread: do the cuda graph capture, and there’s memory allocation within the capture workload.
the second thread: this thread is created within graph capture, and there’s also memory allocation in the thread function body.
Within the test code,
-
for capture_error_mode “relaxed”:
it is expected that we are able to capture the graph successfully, because “relaxed” will not error on actions. -
for capture_error_mode “global”:
it is expected that we are unable to allocate the memory in the second thread (and so the capture is not successful), because “global” will error on actions in other threads. -
for capture_error_mode “thread_local”:
my expectation is that the main thread is error (and so the capture is not successful), because “thread_local” will only error for actions in the current thread. But the test shows that the actual result is that we are able to capture the graph successfully, why? thanks.