I know that CUDA calls are non-blocking and run asynchronously, such that if I call some operation on the data then unless I perform some operation that requires a sync (.item(), .to() etc) my code can continue to run other lines of code and not wait for the operation to finish.
I know (or at least I think I know) that the metadata for the tensor is saved on the CPU, so for example a call to .shape does not require a sync.
If that is the case, what happens when the shape of the output of some operation is not yet known? Say:
# f is something that takes a lot of time
y = f(x)
y = f(x)
if y == 0:
In the second case, does the condition cause a sync?
Testing similar cases myself is something that I usually love to do, but here I’m not sure how to construct a function
f(x) that contains a CUDA kernel that takes an arbitrarily long time to run. Bonus points if you can show me a way to do that.
Yes, since you added a data-dependent control flow and the host needs to wait before deciding which code path to take.
Spin a large enough
matmul in a loop.
Here is a recent example I’ve posted in another issue showing how the CPU can run ahead or would be blocked.
EDIT: you can also use the (experimental)
torch.cuda.set_sync_debug_mode("warn") mode and set it to
error to receive:
# UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
# if y[0, 0] == 0:
Thank you @ptrblck, comprehensive and helpful as usual.
When I try to loop multiple times and then ask for the shape as you suggested, I see that no sync happens. That must mean that the metadata (such as the shape) is saved on the CPU and is available even before the data itself is calculated on the GPU.
I still have questions:
- So if I understand correctly, when a developer adds a new operation, they must have a portion, separate from the code that runs on the GPU, where they have to specify the expected shape? And I’ve never seen such code because it is part of the C/C++ codebase, abstracted away from the python codebase?
- Using your suggested method, I saw that
tensor.nonzero()) causes a CUDA sync. Is that a general rule of thumb then, that when the shape cannot be known beforehand, the call must be blocking?
Yes, your explanation is valid and usually the shape of the output tensor can be calculated from the input shape as well as the setup of the operation (e.g. padding, stride etc. for a convolution). However, the output shape of
nonzero depends on the actual values and is thus triggering a sync.