Memory leak in LibTorch, extremely simple code

Ben_Harper · February 25, 2019, 6:20am

I have the following tiny code snippet, which allocates a new chunk of CUDA memory every time I call model.forward():

auto model = torch::jit::load("model.tm");
auto out   = torch::empty({1, 512, 512});
for (int i = 0; i < 20; i++) {
	auto in  = torch::empty({1, 3, 512, 512}, torch::kCUDA);
	auto res = model->forward({in}).toTensor(); // every time this is called, another 2GB CUDA memory
	out.copy_(res[0]); // without this line, no leaks
}

Obviously, this runs out of CUDA memory very quickly.
Curiously, if I don’t consume the result of model.forward(), then no leaks.
Am I doing something wrong?

I’m on Ubuntu 18.04, CUDA 10.0, and PyTorch compiled from source, v1.0.1 branch.

Ben_Harper · February 25, 2019, 11:50am

I have found a workaround (pseudo-code):

float* buf = (float*) malloc(bufSize);
for (int i = 0; i < 20; i++) {
    // input is a CUDA tensor
    auto res = model->forward(input);
    auto resCPU = res.cpu();
    memcpy(buf, resCPU.accessor<float, 3>().data(), bufSize);
    // resCPU goes out of scope here, so presumably some or other reference count
    // is also release, so that the CUDA memory involved in the model.forward() can now
    // be released.
}

The key difference here is that there is no long-lived object that is the destination of a CUDA -> CPU copy. In my original example, the ‘out’ object seems to cause references to the CUDA memory to never be released.
Is this intended behaviour?

albanD · February 26, 2019, 11:05am

Hi,

The problem is that you’re doing a forward in a network. And if you don’t disable the autograd, then all the buffers are saved to be able to backprop when you want.
Then you save that output in another Tensor thus keeping this whole history around.

If you just want to do evaluation, you should disable GradMode (not sure where it is in the cpp doc).

shane-carroll · February 27, 2019, 2:53am

The c++ equivalent of torch.no_grad() would be NoGradGuard from
torch/csrc/api/include/torch/utils.h. From the current comments you can see that it is a thread-local guard to disable gradients.

torch::NoGradGuard no_grad_guard;
[rest of your code with no grad]

Ben_Harper · February 27, 2019, 2:58am

Thanks @albanD, @shane-carroll!

torch::NoGradGuard no_grad_guard works!

Memory usage is also over 2x less, which makes sense.

This really ought to be in the “getting started” docs. I’ll try and find time to make a little PR that includes this in the documentation.

Z_Froyo · June 28, 2019, 8:38am

when I am on CPU, I have the same problem. I tried [torch::NoGradGuard no_grad_guard;] but its of no use. I compile the code to dll for another program to use. There’s no problem when I run my code independently. In another program, its still ok when the function [forward] is annotated…I really dont know how to solve, anyone knows?please help thanks!

Kai_Huang · September 18, 2019, 7:33am

I also encountered this question. Using torch::NoGradGuard no_grad_guard works.
Want to ask what if I want to backward as well? In that case we can’t set NoGradGuard…
Thanks for the help in advance!

albanD · September 18, 2019, 2:05pm

If you want to backward, then you need to keep these buffers. Be careful though only to link things that make sense for your problem.
If you want to accumulate the loss for example, you will need to use detach() to make sure the history won’t be kept.

Kai_Huang · September 20, 2019, 4:48am

Thanks so much for your reply!

So after accumulating loss or getting gradients, what variables need to be detached?

albanD · September 20, 2019, 1:44pm

The variables that naturally go out of scope don’t need any special treatment.
If you keep some things around across iterations, for example if you accumulate the loss for every batch of the epoch, then you want to detach before accumulating in this buffer.

Kai_Huang · September 22, 2019, 10:09am

Oh, I see. What if I just get the loss, gradient input and weights, do I need to do something special?
Thanks!

albanD · September 22, 2019, 4:13pm

That depends what you do with them
If you return that to other code as plain values. Then you should detach them before returning them.
If you just use them locally to update the weights, nothing is needed.