Tensor state in cuda kernel

I am using a custom kernel where a new tensor is created in the host calling method and passed to the kernel. I am not using device synchronization at any point so the kernel is launched asynchronously and returns to python. Now if I access the elements of the created tensor, it seems the kernel is synchronised and computed values are returned. This is an expected behavior in pytorch. But how does pytorch do this?. Is there any global state maintained where every tensor’s pointer location is recorded and checked if they are in use by cuda kernel before accessing its elements? . Also if I pass a constant reference of a tensor to the cuda kernel which will be only read but not changed, will accessing its element also synchronize the kernel?

PyTorch would synchronize the code (for you), if a CPU operation is used. If I’m not mistaken a Python print would fit this definition. However, besides that you should still be careful about your custom CUDA code as you can create race conditions by missing necessary syncs.