Pytorch Deallocates Data in C++ Extension While Using It?

I’m using CUDAExtension from torch.utils.cpp_extension to build a CUDA extension. In the C++ CPU function called by Python (bound using Pybind), I allocate and initialize a device pointer, then use the counter in kernels, then free it.

Sometimes, the counter will become garbage while it’s being used, and print as garbage after use. Other times, it will execute correctly on the exact same data.

It seems more likely to become garbage when there is more GPU memory in use.

Can PyTorch garbage-collect GPU data while a pointer referencing it is still in-scope?

If so, how do I tell PyTorch not to deallocate that data? How can PyTorch call any external library (like CUDNN) without deallocating its GPU data?

This is not a minimal case, but a conceptual example:

std::vector<torch::Tensor> somefunction(some inputs){
    int* counter;
    cudaMalloc( (void **)&counter, sizeof(int) );
    cudaMemset( counter, 0, sizeof(int) );

    // Use counter in kernels, where it is incremented

    int h_counter[1];
    cudaMemcpy( h_counter, counter, sizeof(int), cudaMemcpyDeviceToHost );

    printf("counter is now garbage: %i\n", *h_counter);


    return {some tensor};


Pytorch won’t touch any data that you allocated yourself.
Are you sure that you use it properly in your kernels and that you have the proper sync before reading its value?