When developing custom CUDA kernels, it is common to get a pointer to the GPU memory and pass it to the kernel as shown below.
kernel<<<num_block, num_threads>>>(tensor.contiguous().data_ptr<float>())
Where tensor
is a tensor that is passed from Python side. The problem is, GPU kernels are asynchronous. So, how do we know that the pointer that is passed to the kernel is still valid when the kernel is actually launched?
More specifically, the call to this kernel returns immediately. So, how does PyTorch know that the Tensor is still needed and shouldn’t be freed?