RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [3, 1]], which is output 0 of TanhBackward, is at version 1; expected version 0 instead

I cannot debug the code as it’s not executable, but you should check these operations:

t[i,time_idx+1]=t[i,time_idx]+Time_res+ torch.sum(torch.dot(self.network_output[i,:], diff))

sumerror[i]=torch.log(torch.tensor(2+time_idx))*error

Nodein[b,:]=Node

as they are all assigning tensors inplace and try to replace them with a new tensor creation, if possible.

Thank you very much for the response. I did try changing the functions above as suggested but nothing changed still.

Here is the complete executable code:

import os
os.environ['TF_CPP_MIN_LOG_LEVEL']="2"
import numpy as np # linear algebra
import random
from numpy import newaxis
from numpy import array
import torch
from torch import nn
torch.pi = torch.acos(torch.zeros(1)).item() * 2 # which is 3.1415927410125732  
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
import math
import array
from sklearn.metrics import confusion_matrix 
plt.style.use('fivethirtyeight')
print ('import completed')



#initializing the power time and distance
global N
global t
global Time_res
N=5
t=np.zeros((N,201), dtype=float)
t=torch.Tensor(t)
Tx_powers=np.ones((N,1), dtype=int)
AttExp = 2              # Attenuation Exponent (2 = free space)
AttConst = 1000**2       # Attenuation coefficient (e.g., Aeff)
X_max= 20
Y_max=20
Time_res=1/200
location_X=X_max*np.random.rand(N,1)
location_Y=Y_max*np.random.rand(N,1)
Distance_matrix=np.zeros((N,N), dtype=float)
P0= 0.01

#Allocating the different nodes randomly
for i in range(N):
    for j in range(i+1, N):
        Distance_matrix[i,j]=1000*np.sqrt(np.square(location_X[j]-location_X[i])+np.square(location_Y[j]-location_Y[i]))
        Distance_matrix[j,i]=Distance_matrix[i,j]
        
#Generating RSSI matrix, i.e. Pr
#Row index = receiver, Column index = transmitter
RSSI_Matrix = np.zeros((N,N),dtype=float)
for i in range(N):
    for j in range(N):
        if i != j :     # If ii=jj the RSSI is infinite
            RSSI_Matrix[i,j] = Tx_powers[j]*AttConst/Distance_matrix[i,j]**AttExp
        elif i == j:
            RSSI_Matrix[i,j] = 0
            
# Initializing nodes to random clocks. 
# t_i(0) is the initial offset of user % i. It is uniform over 1/200 sec
# Time resolution is 1/200 of a sec. We count timing in unit of T_res = 1/200 sec

for i in range (N):
    t[i,0]= torch.rand(1,1)*Time_res
    
    
plt.scatter(location_X, location_Y, label='WSN Layout', marker='o')

Data_pow=pd.DataFrame(data=RSSI_Matrix, index=None, columns=None, dtype=None, copy=None)
print(Data_pow.to_string())

Data_time=pd.DataFrame(data=t, index=None, columns=None, dtype=None, copy=None)
print(Data_time.loc[0:N,0].to_string())

# Get cpu or gpu device for training.
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using {} device".format(device))



# Defining the Network model
class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        """this is the neural network 
        it can take a shape of 9 columns by any column
        the output layer is 2
        """
        self.model = nn.Sequential(
            nn.Linear(3*N,N-1),
            nn.Sigmoid(),
            nn.Linear(N-1,N-1),
            nn.Sigmoid(),
            nn.Linear(N-1,N-1),
            nn.Softmax(),
        )

    def extra_repr(self, time_idx):
        return time_idx
    
    def forward(self, x, time_idx):
        
        #x is input data that you want to pass into the neural network
        # self.network_output is network output
        #t[:,time_idx+1] is the new clock time for the next time index (Eqn. 16) gotten from the weights (i.e., softmax o/p)
        self.network_output = self.model(x)
        
        for i in range(N):
            diff=t[:,time_idx]-t[i,time_idx]
            diff=diff[diff!=0]
            t[i,time_idx+1]=t[i,time_idx]+Time_res+ torch.sum(torch.dot(self.network_output[i,:], diff))
        self.newtime=t[:,time_idx+1]
        
        return self.network_output, self.newtime
    
    def custom_loss(self, t_outvec, time_idx):
        sumerror=torch.zeros((N,1), dtype=float)
        for i in range(N):
            ee=t_outvec-t_outvec[i]
            error= torch.sum(torch.square(ee))
            sumerror[i]=torch.log(torch.tensor(2+time_idx))*error
        self.loss=torch.sum(sumerror)
        return self.loss
modelmy=NeuralNetwork().to(device)

# Setting the input Matrix to the NN
Nodein=torch.zeros(N,3*N)
#Node=np.empty()
for b in range(N):
    Node=torch.cat([t[:,0], torch.Tensor(Distance_matrix[b,:]),torch.Tensor(RSSI_Matrix[b,:])])
    Nodein[b,:]=Node

#Runnung the code for 20 timesteps
torch.autograd.set_detect_anomaly(True)
optimizer = torch.optim.SGD(modelmy.parameters(), lr=0.01)
steps = 20
for i in range(steps):
    output = modelmy.forward(Nodein,i)
    loss=modelmy.custom_loss(output[1],i)
    print(loss)
    optimizer.zero_grad()
    loss.backward(retain_graph=True)
    optimizer.step()

My guess is there is a problem with the customized loss function, just can point out what exactly it is. Thanks in advance for your help.

The new code still seems to use all previously mentioned in-place operations so I would expect it also to fail.

My bad, Sorry for the back and forth. I tried using the .clone() function to sort the issue with inplace operation. However it still gave the same error. I am not sure of another way to get this done. is there another route different from .clone()?

Here is the edited NN code below:

import os
os.environ['TF_CPP_MIN_LOG_LEVEL']="2"
import numpy as np # linear algebra
import random
from numpy import newaxis
from numpy import array
import torch
from torch import nn
torch.pi = torch.acos(torch.zeros(1)).item() * 2 # which is 3.1415927410125732  
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
import math
import array
from sklearn.metrics import confusion_matrix 
plt.style.use('fivethirtyeight')
print ('import completed')



#initializing the power time and distance
global N
global t
global Time_res
N=5
t=np.zeros((N,201), dtype=float)
t=torch.Tensor(t)
Tx_powers=np.ones((N,1), dtype=int)
AttExp = 2              # Attenuation Exponent (2 = free space)
AttConst = 1000**2       # Attenuation coefficient (e.g., Aeff)
X_max= 20
Y_max=20
Time_res=torch.tensor(1/200)
location_X=X_max*np.random.rand(N,1)
location_Y=Y_max*np.random.rand(N,1)
Distance_matrix=np.zeros((N,N), dtype=float)
P0= 0.01

#Allocating the different nodes randomly
for i in range(N):
    for j in range(i+1, N):
        Distance_matrix[i,j]=1000*np.sqrt(np.square(location_X[j]-location_X[i])+np.square(location_Y[j]-location_Y[i]))
        Distance_matrix[j,i]=Distance_matrix[i,j]
        
#Generating RSSI matrix, i.e. Pr
#Row index = receiver, Column index = transmitter
RSSI_Matrix = np.zeros((N,N),dtype=float)
for i in range(N):
    for j in range(N):
        if i != j :     # If ii=jj the RSSI is infinite
            RSSI_Matrix[i,j] = Tx_powers[j]*AttConst/Distance_matrix[i,j]**AttExp
        elif i == j:
            RSSI_Matrix[i,j] = 0
            
# Initializing nodes to random clocks. 
# t_i(0) is the initial offset of user % i. It is uniform over 1/200 sec
# Time resolution is 1/200 of a sec. We count timing in unit of T_res = 1/200 sec

for i in range (N):
    t[i,0]= torch.rand(1,1)*Time_res
    
    
plt.scatter(location_X, location_Y, label='WSN Layout', marker='o')

Data_pow=pd.DataFrame(data=RSSI_Matrix, index=None, columns=None, dtype=None, copy=None)
print(Data_pow.to_string())

Data_time=pd.DataFrame(data=t, index=None, columns=None, dtype=None, copy=None)
print(Data_time.loc[0:N,0].to_string())
print(t[:,0])


# Get cpu or gpu device for training.
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using {} device".format(device))



# Defining the Network model
class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        """this is the neural network 
        it can take a shape of 9 columns by any column
        the output layer is 2
        """
        self.model = nn.Sequential(
            nn.Linear(3*N,N-1),
            nn.Sigmoid(),
            nn.Linear(N-1,N-1),
            nn.Sigmoid(),
            nn.Linear(N-1,N-1),
            nn.Softmax(dim=1),
        )

   
    def forward(self, x, time_idx):
        
        #x is input data that you want to pass into the neural network
        # self.network_output is network output
        #t[:,time_idx+1] is the new clock time for the next time index (Eqn. 16) gotten from the weights (i.e., softmax o/p)
        self.network_output = self.model(x)
        
        for i in range(N):
            diff=t[:,time_idx].clone()-t[i,time_idx].clone()
            diff=diff[diff!=0].clone()
            t[i,time_idx+1]=t[i,time_idx].clone()+Time_res.clone() + torch.sum(torch.dot(self.network_output[i,:].clone(), diff.clone()))
        self.newtime=t[:,time_idx+1].clone()
        
        sumerror=torch.zeros((N,1), dtype=float)
        mse_loss = nn.MSELoss()
        for i in range(N):
            for j in [a for a in range(N) if a != i]:
                sumerror[i]=sumerror[i].clone()+mse_loss(self.newtime[i].clone(),self.newtime[j].clone())
        
        self.loss=torch.sum(sumerror.clone())   
                
        return self.network_output, self.newtime, self.loss
    
modelmy=NeuralNetwork().to(device)


# Setting the input Matrix to the NN
Nodein=torch.zeros(N,3*N)
for b in range(N):
    Node=torch.cat([t[:,0].clone(), torch.Tensor(Distance_matrix[b,:]).clone(),torch.Tensor(RSSI_Matrix[b,:]).clone()])
    Nodein[b,:]=Node.clone()
print(Nodein)


#RUNNING THE CODE
torch.autograd.set_detect_anomaly(True)
optimizer = torch.optim.SGD(modelmy.parameters(), lr=0.01)
steps = 20
for i in range(steps):
    output = modelmy.forward(Nodein,i)
    #loss=modelmy.custom_loss(output[1],i)
    loss=modelmy.loss
    print(loss)
    optimizer.zero_grad()
    loss.backward(retain_graph=True)
    optimizer.step()

Not sure, but, I am guessing the optimizer.step is modifying the parameters inplace.

I am reasonably sure that it doesn’t, because it is a PyTorch method.

Parameters are update inplace by optimizers as seen here.
@Raphael_Emeka if you think this might be the issue, you might be running into this issue and would need to check your wokflow to make sure no stale forward activations are used.

2 Likes

Ok Thanks will check it out

Will making a .clone() of the tensor make sense to resolve this error?

Yes, cloning the tensor which is disallowed to be changed inplace should work.

2 Likes

Thanks @ptrblck. When is a tensor disallowed to be changed inplace?

Inplace operations are disallowed if the tensor is needed to calculate the gradients during the backward pass as e.g. given in this example.

1 Like

I am trying SAM optimizer for multiple methods and I am facing this error for both ways of calling backward() multiple times.

# -----------------SAM Optimizer -------------------
        
        criterion(models['backbone'](inputs)[0], labels)
        loss.backward(retain_graph=True)
        optimizers['backbone'].first_step(zero_grad=True)
        
        criterion(models['backbone'](inputs)[0], labels)
        loss.backward(retain_graph=True)
        optimizers['backbone'].second_step(zero_grad=True)

        # -----------------SAM Optimizer for LLOSS Method -------------------
        if method == 'lloss':
            #optimizers['module'].step()
            loss1 = criterion(models['backbone'](inputs)[0], labels)
            loss1.backward()
            optimizers['module'].first_step(zero_grad=True)
            
            loss2 = criterion(models['backbone'](inputs)[0], labels)
            loss2.backward()
            optimizers['module'].second_step(zero_grad=True)

            loss = torch.tensor([loss1, loss2])
            loss.backward(gradient=torch.tensor([1.0,1.0]))

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [512, 100]], which is output 0 of AsStridedBackward0, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

I don’t know how the optimizer is exactly working internally, but if the first_step() is already updating parameters used to calculate loss2, the loss2.backward() call would fail since the forward activations are stale and the parameters were already updated inplace.

This is taken from here GitHub - davda54/sam: SAM: Sharpness-Aware Minimization (PyTorch)
You are right, I fixed the first one but the second method is still giving an error

#---------------Defination of LLOSS------------
if method == 'lloss':
                base_optimizer = torch.optim.SGD   
                optim_module   = SAM(models['module'].parameters(),  base_optimizer, lr=LR, 
                    momentum=MOMENTUM, weight_decay=WDECAY)
                sched_module   = lr_scheduler.MultiStepLR(optim_module, milestones=MILESTONES)
                optimizers = {'backbone': optim_backbone, 'module': optim_module}
                
                schedulers = {'backbone': sched_backbone, 'module': sched_module} 
            
# -----------------SAM Optimizer -------------------
        
        criterion(models['backbone'](inputs)[0], labels)
        loss.backward(retain_graph=True)
        optimizers['backbone'].first_step(zero_grad=True)
        
        criterion(models['backbone'](inputs)[0], labels)
        optimizers['backbone'].second_step(zero_grad=True)

        # -----------------SAM Optimizer for LLOSS Method -------------------
        if method == 'lloss':
            #optimizers['module'].step()
            criterion(models['backbone'](inputs)[0], labels)
            loss.backward(retain_graph=True)
            optimizers['module'].first_step(zero_grad=True)
            
            criterion(models['backbone'](inputs)[0], labels)
            optimizers['module'].second_step(zero_grad=True)

ERROR

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [512, 100]], which is output 0 of AsStridedBackward0, is at version 3; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Using retain_graph=True will not fix the issue as it would only keep the intermediate forward activations alive. The main issue might still be the same: the step() method updates parameters which would be needed for the next backward call.
In this case you could either recalculate the forward pass to create the forward activations using the already updated parameters or update the parameters after all gradients were computed.

thank you so much for your comments, can you share with me any related tutorials.

I don’t know if there is a good tutorial, but this code snippet shows why this approach is mathematically wrong:

# setup
model = nn.Sequential(
    nn.Linear(10, 10),
    nn.ReLU(),
    nn.Linear(10, 10)
)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
criterion = nn.MSELoss()

# forward pass
x = torch.randn(1, 10)
out = model(x)

# loss calclation
loss = criterion(out, torch.rand_like(out))

# gradient calculation using the intermediate forward activations from the 
# previous forward pass (a0) and the current parameter set (p0)
loss.backward(retain_graph=True)

# update parameters to new set p1
optimizer.step()

# gradient calculation using the stale activations (a0) and the new parameter
# set p1, which will not work as it's mathematically wrong
loss.backward()
# RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [10, 10]], which is output 0 of AsStridedBackward0, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Hi, I’m facing a similar issue:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [128, 10000]], which is output 0 of SoftmaxBackward0, is at version 10; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

#Snippet-1 is the head part of the model
class HeadPart(nn.Module):

    def __init__(self, num_features, num_classes):
        
        super().__init__()
        self.num_classes = tuple(num_classes)
        for num_class in self.num_classes:
            assert num_class > 0

        self.heads = nn.ModuleList(
            # [nn.Linear(num_features, num_class) for num_class in self.num_classes] **# Line-A**
            [nn.Sequential(nn.Linear(num_features, num_class), nn.Softmax(1)) for num_class in self.num_classes] **#Line-B**
            
        )

    def forward(self, x):
        return [head(x) for head in self.heads]

The code was working fine! But when I updated 1) the output layer values of the model (according to the requirements of the task but with the same shape) and 2) replace line-A with line-B; the above-mentioned runtime error occurs in line-C of the below snippet-2.

#Snippet-2
class NativeScalerWithGradNormCount:
    state_dict_key = "amp_scaler"

    def __init__(self):
        self._scaler = torch.cuda.amp.GradScaler()

    def __call__(
        self,
        loss,
        optimizer,
        clip_grad=None,
        parameters=None,
        create_graph=False,
        update_grad=True,
    ):
        self._scaler.scale(loss).backward(create_graph=create_graph) **#Line-C**
        if update_grad:
            if clip_grad is not None:
                assert parameters is not None
                self._scaler.unscale_(
                    optimizer
                )  # unscale the gradients of optimizer's assigned params in-place
                norm = torch.nn.utils.clip_grad_norm_(parameters, clip_grad)
            else:
                self._scaler.unscale_(optimizer)
                norm = ampscaler_get_grad_norm(parameters)
            self._scaler.step(optimizer)
            self._scaler.update()
        else:
            norm = None
        return norm

# snippet-3
def updated_output(hierarchy, outputs): 

    for level in range (len(hierarchy)):
        for p in hierarchy[level].keys():
            outputs[level+1][:, hierarchy[level][p]]=(outputs[level+1][:, hierarchy[level][p]]).mul_(outputs[level][:, [p]])
        outputs[level+1]=normalize_output(outputs[level+1])

    return outputs

A few observations:

  1. The code works with line-A of snippet-1, even though I update the output layer values of the model (snippet-3).
  2. The code works with Line-B of code snippet-1 only if I don’t update the output layer values of the model (snippet-3)
  3. But, If I use Line-B of snippet-1, and also update the output layer values of the model (snippet-3) then the above error occurs.

What I want is, to use line B of snippet-1 with updating the output values.

If the error/issue (in place=True) is with snippet 3, what is the efficient way to resolve it?

Need help to resolve this.
Thanks!!

Hi @John5
Not sure I clearly understand what is going on in your code. When you say line 1 and 2 you mean A and B right?
So adding a soft max and doing i place modifications of your output causes an error that is not raised when you don’t have it, is that right?

I can’t really tell you what’s going on without a further look at the code (we don’t see there the last snippet is used or what is the shape of the tensor in the code etc) but in general you should avoid in-place modifications of your tensors when you’re using autograd. In the last snippet I would create an empty tensor (torch.empty_like(tensor)) and populate it like you do in the loop, rather than modifying the input tensor in place.
Hope that helps!

Thank you @vmoens for your reply!!!
Yes, line 1 and 2 means lines A and B (now corrected).
Using tensor (torch.empty_like(tensor)) solves my problem. Thanks!!!

1 Like