I have the following simple function:
void experiment(const char* filename, bool use_copy) {
torch::jit::script::Module module(torch::jit::load(filename));
module.to(torch::kCUDA);
torch::Tensor input = torch::zeros({8, 2, 7, 6});
torch::Tensor output = torch::zeros({8, 7});
std::vector<torch::jit::IValue> input_vec;
for (int i = 0; i < 10000; ++i) {
torch::Tensor gpu_input = input.clone().to(torch::kCUDA);
input_vec.push_back(gpu_input);
auto gpu_output = module.forward(input_vec).toTuple()->elements()[0].toTensor();
if (use_copy) {
output.copy_(gpu_output);
} else {
output = gpu_output;
output.to(torch::kCPU);
}
input_vec.clear();
if (i % 100 == 0) {
dump_cuda_memory_info();
}
}
}
When I use use_copy=true
, I observe the free memory of the GPU rapidly decrease, indicating a memory leak:
GPU 0 memory: free=21759655936, total=25438715904
GPU 0 memory: free=21078081536, total=25438715904
GPU 0 memory: free=20396507136, total=25438715904
GPU 0 memory: free=19714932736, total=25438715904
GPU 0 memory: free=19033358336, total=25438715904
GPU 0 memory: free=18351783936, total=25438715904
GPU 0 memory: free=17670209536, total=25438715904
GPU 0 memory: free=16988635136, total=25438715904
GPU 0 memory: free=16307060736, total=25438715904
GPU 0 memory: free=15625486336, total=25438715904
...
When I pass use_copy=false
, the free memory stays fixed, indicating no memory leak.
In my application, however, the use_copy=false
approach is not viable, as I require the CPU output tensor’s data ptr to be unchanging.
What is the right way to copy tensors from GPU to a fixed CPU memory address without leaking memory?
Elsewhere in these forums, I found the suggestion to free memory from the cache by calling emptyCache()
. This does not seem viable anymore, as simply adding the line:
#include <c10/cuda/CUDACachingAllocator.h>
leads to a compile error using build-version 1.13.0+cu116
.