How does PyTorch Handle GPU Tensors in Statements, Particularly Regarding Memory Exchanges and Debugger Impact?

I have a question about how Python executes statements involving tensors on the GPU. I assume that when Python reaches such a statement, all its parameters should be in the main memory, but most tensors are on the GPU. To my surprise, I noticed that there is no device-to-host exchange when executing such a statement, and I’m curious about how PyTorch achieves this.

Specifically, for the following two statements:

context_layer = torch.nn.functional.scaled_dot_product_attention(query_layer, key_layer, value_layer, is_causal=True)
context_layer = context_layer.permute(2, 0, 1, 3)

I thought that after the first statement is executed, context_layer needs to be transferred back to the CPU so that Python can execute the second statement. However, I did not observe any similar data exchange, even though I set breakpoints on common CUDA functions such as cudaMemcpyAsync, cudaMemcpy and so on.

I believe that this might be achieved through some reference-like mechanism. I encountered this issue while debugging PyTorch with GDB and found inconsistent behavior. When breakpoints are set on the first statement, there is data exchange after execution, but if you execute these two statements directly, there is no data exchange. Therefore, I suspect that this part of the data exchange is caused by the debugger.

Moving data to the CPU after each call would synchronize the code and result in terrible performance. After moving the data to the GPU PyTorch executed CUDA kernels on this already moved data by passing its pointer to the kernel as is the standard approach. There is no reason to move the data back.

Thanks for reply. I now understand the performance issues this may bring and data transfer can be circumvented through the use of pointers. However, I am still puzzled because, after the execution of the first statement, context_layer as a Tensor on the GPU, its data has not been transferred back to the CPU. Does this imply that at this moment, the data portions of context_layer on the CPU and GPU are inconsistent, even though this won’t affect computations (due to the use of pointers)?
And can you give me some documents related to such a question? thanks.

There are no data portions on the CPU. The tensor holds the metadata, such as the shape and stride, on the CPU as well as the data pointer to the GPU memory where the actual data is stored.

1 Like

Thanks. I believe i need to read Tensor’s source code.