RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [1, 6]]

Raphael_Emeka · November 25, 2021, 11:06am

Hi
I encountered this problem and haven’t been able to sort it out for a long while now. Everything works fine until the backward propagation.
it gives me the error message:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [1, 6]], which is output 0 of SliceBackward, is at version 30; expected version 27 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

Heres the code:

#running [1,6] vector inputs to 3 NN models
torch.autograd.set_detect_anomaly(True)
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
#training
k=0
TrainData0=Mynodein[0,0:1,:].clone().detach().requires_grad_(True)
TrainData1=Mynodein[0,1:2,:].clone().detach().requires_grad_(True)
TrainData2=Mynodein[0,2:3,:].clone().detach().requires_grad_(True)

for epoch in range(50):
    
    epoch_lossN0 = 0
    epoch_lossN1 = 0
    epoch_lossN2 = 0
    lgcount=0
    #for i in range(k*Tbatch,k*Tbatch+Tbatch):
    for i in range(1):
        N0_weight = model.T0_forward(TrainData0)
        N1_weight = model.T1_forward(TrainData1)
        N2_weight = model.T2_forward(TrainData2)
        
        diff0=t[:,i]+TimeOfFlight_Matrix[0,:]-t[0,i]
        #diff0=t[:,time_idx].clone()-t[0,time_idx].clone()  #without propagation delay
        diff0_cat=torch.cat([diff0[0:0], diff0[0+1:]])                                                                                                                            
        prod0=torch.sum(diff0_cat* N0_weight)
        t0=t[0,i]+Tnode[0] + prod0  #evaluates the next timestamp for node 0, i.e., t[0,i+1]
        t[0,i+1]=t0
    
        diff1=t[:,i]+TimeOfFlight_Matrix[1,:]-t[1,i]
        #diff1=t[:,time_idx].clone()-t[1,time_idx].clone()  #without propagation delay
        diff1_cat=torch.cat([diff1[0:1], diff1[1+1:]])
        prod1=torch.sum(diff1_cat* N1_weight)
        t1=t[1,i]+Tnode[1] + prod1  #evaluates the next timestamp for node 1, i.e., t[1,i+1]
        t[1,i+1]=t1
    
        diff2=t[:,i]+TimeOfFlight_Matrix[2,:]-t[2,i]
        #diff2=t[:,time_idx].clone()-t[e,time_idx].clone()  #without propagation delay
        diff2_cat=torch.cat([diff2[0:2], diff2[2+1:]])
        prod2=torch.sum(diff2_cat* N2_weight)
        t2=t[2,i]+Tnode[2] + prod2  #evaluates the next timestamp for node 2, i.e., t[2,i+1]
        t[2,i+1]=t2
        
        timevec=[t0,t1,t2]
           
        lgtensor=torch.tensor(2+lgcount)
        
        local_lossN0= torch.log(lgtensor)*lossfn0(timevec)
        local_lossN1= torch.log(lgtensor)*lossfn1(timevec)
        local_lossN2= torch.log(lgtensor)*lossfn2(timevec)
        
     
    optimizer.zero_grad()
    local_lossN0.backward(retain_graph=True)
    local_lossN1.backward(retain_graph=True)
    local_lossN2.backward(retain_graph=True)
        
    optimizer.step()

I don’t know what I am doing wrong

ptrblck · November 25, 2021, 11:43pm

I don’t know where tensor t is coming from but based on your code I would guess that these inplace operations are causing the issue:

t[0,i+1]=t0
t[1,i+1]=t1
t[2,i+1]=t2

Replace these calls with:

t = torch.stack((t0, t1, t2))

(or torch.cat depending on the shape of the tensors) and it might work.

Also, recreating a new tensor will detach it from the computation graph, so replace:

lgtensor=torch.tensor(2+lgcount)

with:

lgtensor = 2+lgcount

assuming lgcount is a tensor attached to a computation graph.

Raphael_Emeka · December 3, 2021, 8:41am

Thanks for your help. I was able to fix the issue. you were right it was with the leaf tensor t.
Recreating the tensor at each epoch detached it from the computation graph. I have been able to get it working now.

On another note, I am obtaining the loss curve and it appears to be declining as expected, however, curve does not seem to converge, here is a figure of the learning curve.

Do you have an suggestions?

ptrblck · December 3, 2021, 9:57am

I don’t know which loss function you are using and what the expected ranges are. Is a start loss of ~4e-5 expected, which seems to be quite low?

Raphael_Emeka · December 3, 2021, 2:59pm

hi I am using an MSE loss, and I was expecting the start loss to be around e-4 to e-6

Raphael_Emeka · December 3, 2021, 6:04pm

Hi, thanks, i think it’s sorted now, I just needed to adjust the learning rate and I added momentum to aid it to converge faster.

Here is the new result