Best practice for using data_ptr<Type>

When developing custom CUDA kernels, it is common to get a pointer to the GPU memory and pass it to the kernel as shown below.

kernel<<<num_block, num_threads>>>(tensor.contiguous().data_ptr<float>())

Where tensor is a tensor that is passed from Python side. The problem is, GPU kernels are asynchronous. So, how do we know that the pointer that is passed to the kernel is still valid when the kernel is actually launched?
More specifically, the call to this kernel returns immediately. So, how does PyTorch know that the Tensor is still needed and shouldn’t be freed?

The CUDACachingAllocator checks if the block requested to be freed (or rather moved to the cache) is in use and if so will record an event as seen here.

1 Like