Freeing buffer strange behavior

I’ve tried to run WGAN model and meet some strange behavior. I’m not sure is it my bug, or pytorch bug. I made a minimal example to discuss it here.

So, we have an error

RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

But if you look into the gist you can find my comments on how to change it to run successfully. For example, I don’t understand how inplace ReLu operations infuence on this error (note http://i.imgur.com/6H26H6o.png).

Main questions: can you explain why it works this way? Why those errors occur?

Details: pytorch 0.4.1, ubuntu 16.04, GeForce 1080TI

Note 1. Module BrokenBlock has no parameters. You can check it using the folowing code:

b = BrokenBlock(3)
for p in b.parameters():
    print(p.data.size())

And if it has no parameters then backward should not walk through it. But changing ReLu(inplace=False) in BrokenBlock somehow influences the backward to walk successfully.

Hi,

The thing is that during .backward() call, to reduce the peak memory usage, pytorch frees all the memory buffers kept by all the functions in the graph.
This means that you cannot call .backward() twice on the same graph otherwise you will get the error you see above. Note that for graphs that do not have any memory buffer, this will work properly as no missing buffers will be encountered during the backward pass.

I think each of the different scenario you show will actually change which functions keep buffers or not.
What you need to set is gradient_penalty.backward(retain_graph=True) and loss_D_real.backward(retain_graph=True).

@albanD you are right in general and it is obvious. But I think you are wrong in the conclusions.

Consider the folowing changes:

        # gradient_penalty.backward()  # comment this line and error disappears
        # loss_D_real.backward()
        # loss_D_fake.backward()
        loss_D = loss_D_fake - loss_D_real + gradient_penalty
        loss_D.backward()

The error still occurs. It means that some vertices (nodes) of the graph were searched twice, but I don’t call .backward() twice as you said. So, I think that graph is backwarded twice by single call. It is obviously caused by autograd.grad routine but why error dessapears when Relu set not inplace?

On more thing. I haven’t found full description about retain_graph option and buffers it frees. What those buffers contain? Why I can or cann’t free them? Can you llink something for this topic to me?

For the buffer part. Every op will store anything that it will need for backward. For example the op a * b will need to store both a and b to be able to backward. These buffers are freed just after the backward call. To reduce memory usage by not keeping them around when they’re not needed. If you plan on calling this backward function twice, then you should tell the graph not to free them by using retain_graph so that the second call to backward still has the buffers.

In your case I agree that this should not happend with a single call to backward().
Your code is quite big so it may take a bit of time for us to check it. If you have a smaller version it would be welcome :slight_smile:

I reduced it as much as I could, but I’ll try to do more. I think visualization of graph and buffers for its vertices can help to understand what is going on. Can I visualize it using tensorboard? Any link will be very helpfull.

In your case I agree that this should not happend with a single call to backward() .

Notice one more thing. Let’s conssider the folowwing changes:

        # gradient_penalty.backward()  # comment this line and error disappears
        loss_D_real.backward()
        loss_D_fake.backward()

The error dissapears, but we have run backward twice! Tensors loss_D_real and loss_D_fake are obtained by the same model netD. So, to understand the problem we need to verify what is going on during backward pass. I need more documentation.

Here is a more minimal example. No more cuda/data parallel/nn.Module and such :smiley:
Still looking into why it fails

import torch
from torch import nn, cuda
from torch.autograd import Variable, grad
from torch.nn import functional as F

# Debug stuff
import torchviz
torch.autograd.set_detect_anomaly(True)

inputs = torch.ones((1, 3, 256, 256), requires_grad=True)

tmp1 = F.instance_norm(inputs)
tmp2 = F.threshold(tmp1, 0., 0., True)
prob_interpolated = torch.sigmoid(tmp2)

gradients = grad(outputs=prob_interpolated, inputs=inputs,
                 grad_outputs=torch.ones(prob_interpolated.size()),
                 create_graph=True, retain_graph=True)[0]

gradient_penalty = gradients.sum()

# Debug graph
torchviz.make_dot(gradient_penalty).view()
gradient_penalty.backward()

Ok !
After 2h of bughunting, I found the problem and it’s a bug on our side.

From your side, you will have to use a workaround for now.
For your current code, replace the ReLU you’re using with:

class ReLU(nn.Module):
    def __init__(self, inplace=False):
        super(ReLU, self).__init__()
        self.inplace = inplace

    def forward(self, input):
        if self.inplace:
            return torch.relu_(input)
        else:
            return torch.relu(input)

And it should all be fine :slight_smile:

If you see the same problem again you can post here and we’ll find a way around it.

I’ll open an issue on github for the bug and edit here when it’s done.

Thanks a lot for the bug report and the repro code.

I’ll add more informations here for future reference.

The corresponding PR to fix it is here.

Smallest repro code:

import torch
from torch import nn, cuda
from torch.autograd import Variable, grad
from torch.nn import functional as F

# Debug stuff
import torchviz
torch.autograd.set_detect_anomaly(True)

inputs = torch.ones((1, 3, 256, 256), requires_grad=True)

tmp1 = (inputs+1).view_as(inputs)
tmp2 = F.threshold(tmp1, 0., 0., True)
prob_interpolated = torch.sigmoid(tmp2)

gradients = grad(outputs=prob_interpolated, inputs=inputs,
                 grad_outputs=torch.ones(prob_interpolated.size()),
                 create_graph=True, retain_graph=True)[0]

gradient_penalty = gradients.sum()

# Debug graph
torchviz.make_dot(gradient_penalty).view()
gradient_penalty.backward()

The computational graph generated is:

The interesting part is the branch on the right that links ThresholdBackwardBackward directly to ThresholdBackward while ThresholdBackward is already wrapped inside the first CopySlices.

The thing is that part of the threshold_ function code is:

  baseType->threshold_(self_, threshold, value);
  increment_version(self);
  rebase_history(flatten_tensor_args( self ), grad_fn);
  if (tracer_state) {
    jit::tracer::setTracingState(std::move(tracer_state));
    jit::tracer::addOutput(node, self);
  }
  if (grad_fn) {
    grad_fn->result_ = SavedVariable(self, true);
  }

As you can see, self is considered as an output of grad_fn when saved. And so when ThresholdBackward is called to generate ThresholdBackwardBackward, self is associated to ThresholdBackward and thus the graph above.

The thing is that after the rebase_history, self is not an output of grad_fn anymore, it’s an output of the rewritten graph.

Changing the save to

grad_fn->result_ = SavedVariable(self, !as_variable_ref(self).is_view());

Make sure that in the case where self’s history is rewritten, we don’t consider it as an output of grad_fn anymore.

After the fix in the PR, the new graph is as expected:

Hi @albanD !

Actually I’m not familar Torch C++ code. So, I understand std::move :slight_smile: but not rebase_history, jit::tracer, SavedVariable, setTracingState. Also, It is not clear why there should be no ThresholdBackward before ThresholdBackwardBackward, because I don’t know how backward pass should be accomplished. That’s why I asked links for detailed description of backward pass.

I think, your comments will be very usefull for people who are deep in Torch code, but not for regular users.

I trust you fixed it :+1:. And also I understand that I can use inplace=False to get the correct code with slight additional memory usage. Thank you!

Hi,

Yes this comment was more to explain the details of the issue without writing too long stuff on the github PR.

You can either use inplace=False or use the version of relu that I gave you above with inplace=True.