I am studying a pytorch model. I am running it on a CPU/GPU heterogeneous system. So whenever there is a dynamic control flow; i.e., an if statement whose predicate depends on the value of a tensor on the GPU), the CPU synchronized with the GPU waiting for the data value to be computed by the GPU(This happens if the GPU is the bottleneck i.e., the CPU is ahead of the GPU and the instruction pointer(on the CPU) reaches the predicate of the if statement before the kernel that calculates the tensor on which the predicate is dependent on finishes on the GPU). Upon profiling, I can see that the CPU spins during the entire synchronization duration. CUDA supports the CPU to block during the synchronization duration by using the cudaSetDeviceFlags API. Also, some issues[1] on NVIDIA developer forum report that this behaviour is present on linux machines. Since I use linux, I can verify this to be true in my case. I have asked a related query on NVIDIA developer forums too[link].
Question
- Does pytorch API allow us to use cudaSetDeviceFlags API? If yes, please direct me on the steps as I cannot find the relevant documentation. If no, are there problems with the CPU blocking instead of spinning during synchronization?
Please feel free to point out anything I missed. Thanks.