Encounter the RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation

Does this still hold as of 2019? I am reassigning variables many times in my code

Another question: can we do with torch.autograd.set_detect_anomaly(True): in JIT modules? It that still expected to work?

One hint: I solved it by not using in_place relu layer instead of replacing all the +=s. This is in JIT mode.

I am new to Pytorch. While running a model from github, I encountered a similar error:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [1, 1024, 1, 1]] is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

I have tried to add inplace = True to nn.Prelu() but didn’t work. Any suggestions?

Traceback (most recent call last):
File “train.py”, line 90, in
g_loss.backward()
File “/Users/zxiao/opt/anaconda3/lib/python3.7/site-packages/torch/tensor.py”, line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/Users/zxiao/opt/anaconda3/lib/python3.7/site-packages/torch/autograd/init.py”, line 100, in backward
allow_unreachable=True) # allow_unreachable flag

Here is the code snippet:

with torch.autograd.set_detect_anomaly(True):
                netG.zero_grad()
                g_loss = generator_criterion(fake_out, fake_img, real_img)
                g_loss.backward()
                
                fake_img = netG(z)
                fake_out = netD(fake_img).mean()

Seems like some sort of bug in my case. When I try loss.backward in my LSTM training module in Jupyter notebook, it works perfectly. When I copy over to colabs, there’s an issue as described above.

I have the same problem with BottleneckLSTM
The LSTM cell code like:

class BottleneckLSTMCell(nn.Module):
	""" Creates a LSTM layer cell
	Arguments:
		input_channels : variable used to contain value of number of channels in input
		hidden_channels : variable used to contain value of number of channels in the hidden state of LSTM cell
	"""
	def __init__(self, input_channels, hidden_channels):
		super(BottleneckLSTMCell, self).__init__()

		assert hidden_channels % 2 == 0

		self.input_channels = int(input_channels)
		self.hidden_channels = int(hidden_channels)
		self.num_features = 4
		self.W = nn.Conv2d(in_channels=self.input_channels, out_channels=self.input_channels, kernel_size=3, groups=self.input_channels, stride=1, padding=1)
		self.Wy  = nn.Conv2d(int(self.input_channels+self.hidden_channels), self.hidden_channels, kernel_size=1)
		self.Wi  = nn.Conv2d(self.hidden_channels, self.hidden_channels, 3, 1, 1, groups=self.hidden_channels, bias=False)  
		self.Wbi = nn.Conv2d(self.hidden_channels, self.hidden_channels, 1, 1, 0, bias=False)
		self.Wbf = nn.Conv2d(self.hidden_channels, self.hidden_channels, 1, 1, 0, bias=False)
		self.Wbc = nn.Conv2d(self.hidden_channels, self.hidden_channels, 1, 1, 0, bias=False)
		self.Wbo = nn.Conv2d(self.hidden_channels, self.hidden_channels, 1, 1, 0, bias=False)
		self.relu = nn.ReLU6()
		self.Wci = None
		self.Wcf = None
		self.Wco = None
		self._initialize_weights()
	def forward(self, x, h, c): #implemented as mentioned in paper here the only difference is  Wbi, Wbf, Wbc & Wbo are commuted all together in paper
		"""
		Arguments:
			x : input tensor
			h : hidden state tensor
			c : cell state tensor
		Returns:
			output tensor after LSTM cell 
		"""
		x = self.W(x)
		y = torch.cat((x, h),1) #concatenate input and hidden layers
		i = self.Wy(y) #reduce to hidden layer size
		b = self.Wi(i)	#depth wise 3*3
		ci = torch.sigmoid(self.Wbi(b) + c * self.Wci)
		cf = torch.sigmoid(self.Wbf(b) + c * self.Wcf)
		cc = cf * c + ci * self.relu(self.Wbc(b))
		co = torch.sigmoid(self.Wbo(b) + c * self.Wco)
		ch = co * self.relu(cc)
		return ch, cc

And the Network code like:

class MyNet(nn.Module):
    def __init__(self, width_mult = 1.0, batch_size = 1):
        super(MyNet, self).__init__()
        self._mob = MobUNet(width_mult=width_mult)
        hidden_channels = self._mob.final_predict_ch
        self._lstm = BottleneckLSTMCell(
            input_channels=self._mob.final_predict_ch,
            hidden_channels=self._mob.final_predict_ch,)
        
        self._distribution = nn.Sequential(
            nn.Conv2d(hidden_channels, 32, kernel_size=3, stride=1, padding=1), nn.ReLU(inplace=True),
            nn.Conv2d(32, 16, kernel_size=3, stride=1, padding=1), nn.ReLU(inplace=True),
            nn.Conv2d(16, 16, kernel_size=1, stride=1), nn.ReLU(inplace=True),
            nn.Conv2d(16, 2, kernel_size=1, stride=1)
        )
        self._map = nn.Sequential(
            nn.Conv2d(hidden_channels, 32, kernel_size=3, stride=1, padding=1), nn.ReLU(inplace=True),
            nn.Conv2d(32, 16, kernel_size=3, stride=1, padding=1), nn.ReLU(inplace=True),
            nn.Conv2d(16, 16, kernel_size=1, stride=1), nn.ReLU(inplace=True),
            nn.Conv2d(16, 6, kernel_size=1, stride=1)
        )
        self.init_weights(self._distribution.modules())
        self.init_weights(self._map.modules())
        self.lstmh,self.lstmc = self._lstm.init_hidden(batch_size,self._mob.final_predict_ch,(320,320))

    def forward(self,x):
        f = self._mob(x)
        self.lstmh,self.lstmc = self._lstm(f,self.lstmh,self.lstmc)
        score = self._distribution(self.lstmh)
        pred_map = self._map(self.lstmh)

        return torch.cat((score,pred_map),1),self.lstmh

The training process like:

    net = MyNet()
    net = net.to('cuda')
    net.init_state()
    opt = optim.Adam(net.parameters(), lr=0.001)
    y = torch.zeros((1,2,320,320)).float().to('cuda')
    torch.autograd.set_detect_anomaly(True)
    for i in range(20):
        x = torch.randint(0,255,(1,640,640,3)).float().to('cuda').permute(0,3,1,2)
        opt.zero_grad()
        pred,_ = net(x)
        loss = torch.mean(torch.abs(pred[:,:2]-y))
        loss.backward(retain_graph=True)
        opt.step()

Then when the second time backward, it’s says:

one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1, 52, 320, 320]] is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

And I trace back and found out:
in f = self._mob(x), the version of x and f are both 1
then, the the version of self.lstmh and self.lstmc at 1st forward is 1
but the return of self._lstm(f,self.lstmh,self.lstmc) is 2 tensor with version 0
at self._lstm.forward
x = self.W(x) will return a version 0 tensor
but the version of self.W and x are both 1
I also try to rename x = self.W(x) to xnew = self.W(x) but not helpful

Any suggestions?

I am facing this error

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: 
[torch.FloatTensor [5, 6]], which is output 0 of TBackward, is at version 2; expected version 1 instead. 
Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

and can’t figure out the solution after reading many articles. The code in which this error is

model.py

from config import *

class Lambda(nn.Module):

    def __init__(self,operation):
        self.operation = operation
        super().__init__()
    
    def forward(self,x):
        return self.operation(x)    

class memory(nn.Module):

    def __init__(self,
                 gamma=0.95, 
                 entry=20, 
                 entry_element=5,  
                 n=10,
                 classes = 10, 
                 input_shape = (28,28,1)):
        
        super().__init__()
        global memory_size
        global batch_size
        self.gamma = torch.tensor([gamma],  requires_grad = False)
        self.entry = entry
        self.entry_element = entry_element
        #alpha = tf.Variable(np.random.randint(1), trainable = True, dtype = tf.float32)
        #gate_param = tf.sigmoid(alpha)
        self.n = n
        self.no_of_parameters = entry_element + 1
        self.no_of_classes = classes
        self.input_shape = input_shape
        self.tanh = nn.Tanh()
        self.softmax = nn.Softmax(dim = 1)
        self.classification_softmax = nn.Softmax(dim = 1)
        self.sigmoid = nn.Sigmoid()
        self.einsum = Lambda(lambda a: torch.einsum('abc,abd->adc',a[0], a[1])) #abc(128,20,5), abd(128,20,1)
        self.reduce_sum = Lambda(lambda x : torch.sum(x, axis = 1, keepdims=True))
        self.keras_multiply = Lambda(lambda xy : torch.einsum('abc,def->abf',xy[0], xy[1]))
        # mask = Lambda(lambda x : torch.slice(tf.sort(x, axis=1, direction='ASCENDING', name=None), begin=[0, n, 0], size=[-1, 1, 1]))
        self.mask = Lambda(lambda xn : torch.sort(xn[0],1)[0][:,-xn[1],:].clone())
        self.greater = Lambda(lambda jk : torch.greater(jk[1],jk[0].tile((1,memory_size[0],1))))
        self.controller = nn.LSTM(self.input_shape[-2]*self.input_shape[0]* batch_size,
                                  self.entry_element,
                                  1) #for inserting just 12 LSTM Layer
        self.key_dense = nn.Linear(self.entry_element,self.no_of_parameters)
        self.classification_layer = nn.Linear(self.entry_element,self.no_of_classes)
        

    def forward(self, inputs, state):
        
        i = torch.squeeze(inputs)
        f = torch.flatten(i,start_dim = 1)
        print(f"The shape of f is {f.shape}")
        inp = torch.reshape(f, (1, 1,-1))
        # print(f"The shape of the inputs is {inputs.to(dtype= torch.float32).dtype}")
        out, (h_n, c_n) = self.controller(inp.float())
        # print(f"The types of LSTM outputs are {type(out)} and {type(h_n)} and {type(c_n)}")
        out_1 = torch.tanh_(out)
        print(f"the shape of out is {out_1.shape}")
        out_2 = self.key_dense(out_1)
        p = out_2[:,:,:self.entry_element]
        key = p
        gate_param = torch.squeeze(torch.sigmoid_(out_2[:,:,-1].clone()))
        # gate_param.squeeze_()

        #writing
        # print(gate_param)
        w_w = torch.add(torch.multiply(gate_param.clone(), state['w_r'].clone()), 
                        torch.multiply((1-gate_param), state['w_lu'].clone()))
        
        print(f"ther shape of w_w is {w_w.shape}")
        print(f"The shape of key is {key.shape}")
        write = self.keras_multiply([w_w, key])
        print(state['M'].clone().shape)
        print((write).shape)
        M = torch.add(state['M'].clone(), write)

        #reading
        
        #M_dot_kt =  dot([tile(M, kt])
        print(f"The shape of M is {M.shape} and key is {key.shape}")
        M_dot_kt = torch.matmul(M, torch.squeeze(key)) #(128,20)
        '''The matmul function do the dotproduct of 3D tesnors'''
        M_dot_kt = torch.unsqueeze(M_dot_kt, dim = -1)
        print(f"The shape and type of the M_dot_kt is {M_dot_kt.shape} and {M_dot_kt.dtype} ")
        w_r = self.softmax(M_dot_kt)
        #w_r = M_dot_kt
        
        r_t = self.einsum([M, w_r])
        print(f"The shape of r_t recieved is {r_t.shape}")

        #least used related computation
        # print(f"The shape of gamma is {self.gamma.shape} and w_u is {state['w_u'].shape}")
        print(self.gamma)
        gamma_w_u = torch.multiply(state['w_u'].clone(),self.gamma) #(128,20,1)
        
        w_u = torch.add(torch.add(gamma_w_u, w_r), w_w)
        masked = self.mask([w_u,self.n])
        tile_masked = torch.tile(masked, (1,self.entry))
        tile_masked.unsqueeze_(-1)
        print(f"MAsked shape is {tile_masked.shape}")
        w_lu = torch.greater(w_u, tile_masked)

        states = [r_t, w_r, w_lu, w_u, M]
        # state_w_r = w_r
        # state_w_lu = w_lu
        # state_w_u = w_u
        # state_m = M    
        '''
        next_states = {
            'read_vector': states[0], 
            'w_r': states[1],
            'w_lu': states[2],
            'w_u': states[3],
            'M': states[4],
        }
        '''
        flattened = torch.flatten(r_t, start_dim = 1)
        print(f"The shape of the flattened varaible is {flattened.shape}")
        flattened_output = self.classification_layer(flattened)
        print("The shape of the output is ",flattened_output.shape)
        # output = torch.reshape(flattened_output, (batch_size,self.no_of_classes))
        pred_class = self.classification_softmax(flattened_output)
        output = pred_class

        return {'read_vector': states[0],
                'w_r': states[1],
                'w_lu': states[2], 
                'w_u': states[3], 
                'M': states[4]}, output


    def zero_state(self,batch_size):
        one_hot_weight_vector = torch.tensor(torch.rand([batch_size, self.entry, 1]), requires_grad = False)
        one_hot_weight_vector[..., 0] = 1
        one_hot_weight_vector = torch.tensor(one_hot_weight_vector, requires_grad = False)

        state = {
            'read_vector': torch.tensor(torch.rand([batch_size, 1, self.entry_element]), requires_grad = False),
            'w_r': one_hot_weight_vector,
            'w_lu': one_hot_weight_vector,
            'w_u': one_hot_weight_vector,
            'M': torch.tensor(torch.ones([batch_size,
                                          self.entry, 
                                          self.entry_element], dtype = torch.float32) * 1e-6, requires_grad = False)
        }
        return state

    

main.py

from config import *
from model import memory
from preprocessing import data_batched, data_batched_test


def train(model, device, train_loader, optimizer, epoch):
    global batch_size
    # model.train()
    state = model.zero_state(batch_size)
    for batch_idx, (data, target) in enumerate(train_loader):
        print(f"The batch_idx value is {batch_idx}")
        data, target = data.to(device), target.to(device)
        
        next_state,output = model(data, state)
        loss = nn.CrossEntropyLoss()(output, target).clone()
        # torch.autograd.set_detect_anomaly(True)
        loss.backward(retain_graph = True)
        optimizer.step() 
        optimizer.zero_grad()
        state = next_state
        if batch_size % 32 == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))

def test(args, model, device, test_loader):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += F.nll_loss(output, target, reduction='sum').item() # sum up batch loss
            pred = output.argmax(dim=1, keepdim=True) # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)

    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))


def main(save_model, epochs):
    # Training settings

    global data_batched_test
    global data_batched
    global memory
    global batch_size
    model = memory()
    # state = model.zero_state(batch_size)
    device = torch.device("cpu")

    optimizer = optim.Adam(model.parameters(), lr = 0.000001)
    print(model.parameters())
    for epoch in range(1, epochs + 1):
        train(model, device, data_batched, optimizer, epoch)
        # test(model, device, data_batched_test)

    if save_model == True:
        torch.save(model.state_dict(),"mnist_cnn.pt")
       
if __name__ == '__main__':
    main(True, 10)()

My Model

class Model(T.nn.Module):

    def __init__(self,n_class):
        super(Model,self).__init__()

        self.encoder = Encoder()
        self.decoder = Decoder(1)
        self.prior   = Prior()
        self.embeder = Embed(n_class)
        self.n_class = n_class
        self.los_fnc = ELBOLoss()  
        self.rep     = Reparamatrize()
    def forward(self,inputs):
        x,y = inputs
        loss = 0
        P = T.zeros_like(x)
        for i in range(self.n_class):
            C = T.zeros_like(P)
            C[...,i] = x[...,i]
            embed = self.embeder([x,C,P])
            mu1,var1 = self.encoder([y[...,i].view(-1,1),embed])
            mu2,var2 = self.prior(embed)
            z = self.rep([mu1,var1])
            dec = self.decoder([embed,z])
            loss = loss+ self.los_fnc([mu1,var1,mu2,var2,dec,y[...,i].view(-1,1)])
            q = T.distributions.Poisson(T.abs(dec))
            n = q.sample()
            P[...,i] = n.view(-1,1)[...,0]
            P = T.multiply(P,x)
        return  loss

Traininng Step

T.autograd.set_detect_anomaly(True)
def Train(model, Data ,optim, epochs = 20):
    for ep in range(epochs):
        x,y = Data
        optim.zero_grad()

        loss = model([T.tensor(x).float(),T.tensor(y).float()])
        loss.backward()
        optim.step()

        print(f'Epoch[{ep+1}/{epochs}] || Loss: {loss}')
    print('Finished Training')

modell = Model(6)
optim = T.optim.Adam(countvae.parameters(),lr=0.0001)
Train(modell,[label_set,ground_truth],optim=optim)

Error


RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [12800, 6]] is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

Please Help
Thanks

Encase this helps anyone:

Took me forever to find out the following does NOT work (this is a network with split-outputs/split-heads)

# loss/zero_grad/backwards/.step() 1
actor_loss = -self.action_choice_distribution.log_prob(self.action_with_gradient_tracking)*advantage.detach()
self.adam_actor.zero_grad()
actor_loss.backward()
self.adam_actor.step()
self.logging.accumulated_actor_loss += actor_loss.item()

# loss/zero_grad/backwards/.step() 2
critic_loss = advantage.pow(2).mean()
self.adam_critic.zero_grad()
critic_loss.backward()
self.adam_critic.step()
self.logging.accumulated_critic_loss += critic_loss.item()

But this does work

actor_loss = -self.action_choice_distribution.log_prob(self.action_with_gradient_tracking)*advantage.clone().detach()
critic_loss = advantage.pow(2).mean()
self.adam_actor.zero_grad()
self.adam_critic.zero_grad()
actor_loss.backward()
critic_loss.backward()
self.adam_actor.step()
self.adam_critic.step()
self.logging.accumulated_actor_loss += actor_loss.item()
self.logging.accumulated_critic_loss += critic_loss.item()

retain_graph=True also did not change/effect this error
would be really nice if there was a better way to debug this, as the inplace operation was not obvious at all

1 Like

Hi @jeff-hykin
what exactly are adam_actor and adam_critic?
im currently running into this issue, where my model got 2 separate Linear models, and i’m using one sgd optimizer where im passing it all my model’s paramerters (like here: pytorch-examples/rnn.py at master · python-engineer/pytorch-examples · GitHub).
Could you please share a bit more about your model setup to understand better

Hi.
I hope your problem got solved. I had this problem and solutions like using function clone() did not work for me. But when I installed pytorch version 1.4, it solved.
I think this problem is kind of bug in step() function. Some weird thing is this bug happen when you use pytorch version 1.5 but it’s not in v1.4.
You can see all released versions of pytorch in this link.

great thanks for the illustration. this helps solve my further bugs also.

Oh!thank you so much,I’ve struck in here so long

Thanks!Your reply help me! But I’m curious why this works.