RuntimeError when gradient computation

I have some problem when I implement my loss function.
(In actually, I want to implement the Maximum likelihood estimation. Thus I have to maintain the forward graph and add the new forward data every time.)
There is a RuntimeError about the inplace operation which may be the assign operation.
Does anyone encounter the same problem and has solved it?

The following code can reproduce the same error with torch version 0.6.1.
And there is a RuntimeError in the second optimizer.step().

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim

class Network(nn.Module):
    def __init__(self, dim, hidden_size):
        super(Network, self).__init__()
        self.fc1 = nn.Linear(dim, hidden_size)
        self.activate = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, 1)
    def forward(self, x):
        return self.fc2(self.activate(self.fc1(x)))
func = Network(3, 10).cuda()
optimizer = optim.SGD(func.parameters(), lr=1e-2)

context_1 = np.array([0.1,0.1,0.1])
context_2 = np.array([0.2,0.2,0.2])
tensor_1 = torch.from_numpy(context_1).float().cuda()
tensor_2 = torch.from_numpy(context_2).float().cuda()
a = func(tensor_1)
b = func(tensor_2)
loss = a*a
func.zero_grad()
loss.backward(retain_graph=True)
optimizer.step()
loss = loss + b*b
func.zero_grad()
loss.backward(retain_graph=True)
optimizer.step()

Thanks!

Hi,

The problem I see here is that the first optimizer.step() is modifying the parameters of your network inplace. But the original values of these parameters are needed to compute the second backward.
Hence the error that you see.
You will have to either delay the optimizer step or redo the forward I’m afraid.

Thanks for helping. It works after redo the forward.
Is there any way not to redo the forward? Because of the maximum likelihood estimation approach, if I have to redo all the forward every time, the computation time will growth linearly.

If you don’t want to redo the forward, you will have to delay the optimizer step.
You may be able to stash the current gradient into a temporary list (cloning them) for each backward you do. Then you can do all the steps when all the gradients are computed.