How to make the CPU block instead of spin on cudaMemcpyAsync/cudaStreamSynchronize

I am studying a pytorch model. I am running it on a CPU/GPU heterogeneous system. So whenever there is a dynamic control flow; i.e., an if statement whose predicate depends on the value of a tensor on the GPU), the CPU synchronized with the GPU waiting for the data value to be computed by the GPU(This happens if the GPU is the bottleneck i.e., the CPU is ahead of the GPU and the instruction pointer(on the CPU) reaches the predicate of the if statement before the kernel that calculates the tensor on which the predicate is dependent on finishes on the GPU). Upon profiling, I can see that the CPU spins during the entire synchronization duration. CUDA supports the CPU to block during the synchronization duration by using the cudaSetDeviceFlags API. Also, some issues[1] on NVIDIA developer forum report that this behaviour is present on linux machines. Since I use linux, I can verify this to be true in my case. I have asked a related query on NVIDIA developer forums too[link].


  • Does pytorch API allow us to use cudaSetDeviceFlags API? If yes, please direct me on the steps as I cannot find the relevant documentation. If no, are there problems with the CPU blocking instead of spinning during synchronization?

Please feel free to point out anything I missed. Thanks.

I don’t think it’s currently possible to change this flag in PyTorch and an older feature request was discussed here. However, I don’t know the status of it.

Thank you for the related issue and the response. No worries, I will keep a track of the issue. I am marking this as the solution to close the topic and will track the issue for a future update.