"RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time" while using custom loss function


(Phan Duc Viet) #1

I keep running into this error:

RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

I’m searching in forum, but still can’t know what I have wrong in my custom loss function.
I’m using nn.GRU, here is my Loss function:

def _loss(outputs, session, items):  # `items` is a dict() contains embedding of all items
    def f(output, target):
        pos = torch.from_numpy(np.array([items[target["click"]]])).float()
        neg = torch.from_numpy(np.array([items[idx] for idx in target["suggest_list"] if idx != target["click"]])).float()
        if USE_CUDA:
            pos, neg = pos.cuda(), neg.cuda()
        pos, neg = Variable(pos), Variable(neg)

        pos = F.cosine_similarity(output, pos)
        if neg.size()[0] == 0:
            return torch.mean(F.logsigmoid(pos))
        neg = F.cosine_similarity(output.expand_as(neg), neg)

        return torch.mean(F.logsigmoid(pos - neg))

    loss = map(f, outputs, session)
return -torch.mean(torch.cat(loss))

Training code:

        # zero the parameter gradients
        model.zero_grad()

        # forward + backward + optimize
        outputs, hidden = model(inputs, hidden)
        loss = _loss(outputs, session, items)
        acc_loss += loss.data[0]

        loss.backward()
        # Add parameters' gradients to their values, multiplied by learning rate
        for p in model.parameters():
            p.data.add_(-learning_rate, p.grad.data)

(jpeg729) #2

I don’t think error is in your loss function. I think any loss function would cause this error.

Am I right in saying that your training loop doesn’t detach or repackage the hidden state in between batches? If so, then loss.backward() is trying to back-propagate all the way through to the start of time, which works for the first batch but not for the second because the graph for the first batch has been discarded.

If I am right then there are two possible solutions.

  1. detach/repackage the hidden state in between batches. There are (at least) three ways to do this.

    1. hidden.detach_()
    2. hidden = hidden.detach()
    3. hidden = Variable(hidden.data, requires_grad=True)
  2. replace loss.backward() with loss.backward(retain_graph=True) but know that each successive batch will take more time than the previous one because it will have to back-propagate all the way through to the start of the first batch.


(Phan Duc Viet) #3

Thanks. I have tried 2nd solution before, but it ran really slow. Your 1st solution helped me to fix it.


(Thong Nguyen) #4

@jpeg729, can you elaborate a bit more:

Mathematically speaking, what would be the difference between solution 1 and solution 2? Or they are mathematically equivalent but not as same computationally efficient ?

Thanks


(Yongjie Shi) #5

The same question~Thank you~