I have a question about how Python executes statements involving tensors on the GPU. I assume that when Python reaches such a statement, all its parameters should be in the main memory, but most tensors are on the GPU. To my surprise, I noticed that there is no device-to-host exchange when executing such a statement, and I’m curious about how PyTorch achieves this.
Specifically, for the following two statements:
context_layer = torch.nn.functional.scaled_dot_product_attention(query_layer, key_layer, value_layer, is_causal=True)
context_layer = context_layer.permute(2, 0, 1, 3)
I thought that after the first statement is executed, context_layer needs to be transferred back to the CPU so that Python can execute the second statement. However, I did not observe any similar data exchange, even though I set breakpoints on common CUDA functions such as cudaMemcpyAsync
, cudaMemcpy
and so on.
I believe that this might be achieved through some reference-like mechanism. I encountered this issue while debugging PyTorch with GDB and found inconsistent behavior. When breakpoints are set on the first statement, there is data exchange after execution, but if you execute these two statements directly, there is no data exchange. Therefore, I suspect that this part of the data exchange is caused by the debugger.