Copy_() memory leak?

I have the following simple function:

void experiment(const char* filename, bool use_copy) {
  torch::jit::script::Module module(torch::jit::load(filename));
  module.to(torch::kCUDA);
  torch::Tensor input = torch::zeros({8, 2, 7, 6});
  torch::Tensor output = torch::zeros({8, 7});

  std::vector<torch::jit::IValue> input_vec;

  for (int i = 0; i < 10000; ++i) {
    torch::Tensor gpu_input = input.clone().to(torch::kCUDA);
    input_vec.push_back(gpu_input);
    auto gpu_output = module.forward(input_vec).toTuple()->elements()[0].toTensor();
    if (use_copy) {
      output.copy_(gpu_output);
    } else {
      output = gpu_output;
      output.to(torch::kCPU);
    }
    input_vec.clear();
    if (i % 100 == 0) {
      dump_cuda_memory_info();
    }
  }
}

When I use use_copy=true, I observe the free memory of the GPU rapidly decrease, indicating a memory leak:

GPU 0 memory: free=21759655936, total=25438715904
GPU 0 memory: free=21078081536, total=25438715904
GPU 0 memory: free=20396507136, total=25438715904
GPU 0 memory: free=19714932736, total=25438715904
GPU 0 memory: free=19033358336, total=25438715904
GPU 0 memory: free=18351783936, total=25438715904
GPU 0 memory: free=17670209536, total=25438715904
GPU 0 memory: free=16988635136, total=25438715904
GPU 0 memory: free=16307060736, total=25438715904
GPU 0 memory: free=15625486336, total=25438715904
...

When I pass use_copy=false, the free memory stays fixed, indicating no memory leak.

In my application, however, the use_copy=false approach is not viable, as I require the CPU output tensor’s data ptr to be unchanging.

What is the right way to copy tensors from GPU to a fixed CPU memory address without leaking memory?

Elsewhere in these forums, I found the suggestion to free memory from the cache by calling emptyCache(). This does not seem viable anymore, as simply adding the line:

#include <c10/cuda/CUDACachingAllocator.h>

leads to a compile error using build-version 1.13.0+cu116.

Could you check if detaching the tensor if you are not using the computation graph or calling backward before calling copy_ (similar to as described here `copy_` operations get repeated in autograd computation graph) helps?
Alternatively, if you know that gradients are not needed anywhere, you could try using the no_grad guard as well:
Typedef torch::NoGradGuard — PyTorch master documentation

1 Like

Thanks, calling .detach() did the trick!

1 Like