Question about Tensor storage lifespan

I have some confusion regarding the lifecycle of PyTorch tensors. The lifecycle of a PyTorch tensor is managed by Python, and CUDA computations in PyTorch are asynchronous to the Python interpreter(CPU). Could a situation arise where Python triggers the destruction of a tensor variable, but at the same time, a CUDA kernel is still executing? In the following example:

a = torch.randn(3, 4).cuda() + torch.randn(1).cuda()

PyTorch asynchronously starts a CUDA kernel for the addition operation. However, the tensor created by torch.randn(3, 4).cuda() is a temporary variable that is destructed when __add__ is called. In this case, the CUDA kernel might still be in progress when the tensor is destructed, potentially leading to the kernel accessing invalid device memory.

However, in my tests, this situation doesn’t seem to occur. I would like to understand how PyTorch ensures that asynchronous CUDA kernels complete their computation before the memory associated with temporary tensors is reclaimed.