Encounter in-place operation in delayed online learning

Joseph_Kao · April 26, 2021, 6:57am

I’m working on a delayed online learning scheme where I want to do the optimization step after certain delay, and I’m encountering the in-place operation when doing slicing. Here is a rough pseudo code of what I’m doing.
Let E be the number of episodes, T be the time horizon of each episode, and D be the delay.

input=torch.zeros(E,T)
output=torch.zeros(E,T)
output_target=torch.zeros(E,T)
for e in range(E):
   for t in range(T):
      input[e,t] = some outer environment source
      output[e,t] = model( input[e,t] )  # model is my defined neural net from nn.Module
      Act in the outer environment with output[e,t]
      output_target[e,t] = from outer environment source
   if e>=D:
      for t in range(T):
         loss += loss_function( output_target[e-D,t] , output[e-D,t] )
      optimizer.zero_grad()
      loss.backward(retain_graph=True)
      optimizer.step()

And I got the in-place operation error: “RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation:”
It seems that it’s because I keep filling the slice of output with new results, so the version of output also changes. What is the efficient way to program the thing I want to do then?

Thanks!

Joseph_Kao · April 27, 2021, 8:55am

Here is a minimal working example that produces the error:

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt

class DNN(nn.Module):
    def __init__(self,dim):
        super(DNN,self).__init__()
        self.fc=nn.Sequential(
            nn.Linear(dim,dim),
            nn.Tanh(),
            nn.Linear(dim,dim)
        )
    def forward(self,data):
        return self.fc(data)

def main():
    E=20
    T=20
    D=5
    dim=3
    model=DNN(dim)
    optimizer=torch.optim.Adam(model.parameters())
    loss_fn=nn.MSELoss()
    data_in=torch.zeros(E,T,dim)
    data_out=torch.zeros(E,T,dim)
    data_target=torch.zeros(E,T,dim)
    for e in range(E):
        for t in range(T):
            data_in[e,t]=torch.from_numpy(np.random.rand(dim))
            data_out[e,t]=model(data_in[e,t])
            data_target[e,t]=torch.from_numpy(np.random.rand(dim))
        if e>=D:
            loss=0.0
            for t in range(T):
                loss += loss_fn(data_target[e-D,t],data_out[e-D,t])
            optimizer.zero_grad()
            loss.backward(retain_graph=True)
            optimizer.step()
    
if __name__=='__main__':
    main()

Error message:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [1, 3]] is at version 120; expected version 119 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Basically the idea is, I want to train the model one by one with the old samples while testing/generating the new samples, i.e. in an online manner. In particular, at episode e, I want to train the model with the sample (output and target) from episode (e-D) while generating the new output of episode e. I have no idea how to do this other than slicing, but slicing (I believe) causes the in-place operation error.

ptrblck · April 27, 2021, 9:35am

Could you remove the inplace modification of the loss tensor and use:

loss = loss + loss_fn(...)

instead?
Based on the posted code snippet it seems to be the first place I would start to debug the issue.

Joseph_Kao · April 27, 2021, 5:51pm

Hello @ptrblck , thank you for your reply. But after changing to “loss=loss+loss_fn(data_target[e-D,t],data_out[e-D,t])” I just got the same error:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [1, 3]] is at version 120; expected version 119 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Joseph_Kao · April 28, 2021, 12:35am

Updated code with the modification mentioned above that still produces the same error:

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

class DNN(nn.Module):
    def __init__(self,dim):
        super(DNN,self).__init__()
        self.fc=nn.Sequential(
            nn.Linear(dim,dim),
            nn.Tanh(),
            nn.Linear(dim,dim)
        )
    def forward(self,data):
        return self.fc(data)

def main():
    E=20
    T=20
    D=5
    dim=3
    model=DNN(dim)
    optimizer=torch.optim.Adam(model.parameters())
    loss_fn=nn.MSELoss()
    data_in=torch.zeros(E,T,dim)
    data_out=torch.zeros(E,T,dim)
    data_target=torch.zeros(E,T,dim)
    for e in range(E):
        for t in range(T):
            data_in[e,t]=torch.from_numpy(np.random.rand(dim))
            data_out[e,t]=model(data_in[e,t])
            data_target[e,t]=torch.from_numpy(np.random.rand(dim))
        if e>=D:
            loss=0.0
            for t in range(T):
                loss=loss+loss_fn(data_target[e-D,t],data_out[e-D,t])
            optimizer.zero_grad()
            loss.backward(retain_graph=True)
            optimizer.step()

if __name__=='__main__':
    main()

tom · April 28, 2021, 8:30am

There are several aspects of wrong here.
One source of error is that the model needs the unmodified data_in but the versioning does not capture regions of data_in. Depending on your setup, you could either use a (nested) list of tensors or generate the inputs outside the loop.

If you fix this ad-hoc by using model(data_in[e,t].clone()), there will be another error from backpropagation with the weights that is triggered when you backprob inside the loop. I would venture that the retain_graph=True is an error here (it almost universally is an error unless you know and can articulate the precise reason why it is not) and you would want to structure your computation to separate forward and backward passes here.
This is my general advice: Don’t keep stuff from the previous training (= forward + backward + optimization) step unless you have a good reason. Personally, I really try to articulate with some precision why things need to be carried over.

Best regards

Thomas

Joseph_Kao · May 6, 2021, 9:29am

Hi @tom , it took me some time to digest what you wrote, especially what you meant by another error from backprop inside the loop.

But my question still remains. Indeed, I’m exactly asking how should I program if (forward + backward + opt) can’t be done in one step – the good reason is for application consideration. In online learning I have to generate good prediction (hence the forward pass) in each step; in the problem I consider, I’d like to add a delay for the backward and optimization parts due to application reasons.

Let me give a concrete flow of I want. Let’s assume for simplicity the delay D=2. I’ll use
In[s] for input of step s, Out[s] for output of step s, and M_i for the i-th version of the model.

The initial model is M_0
Generate Out[1] = M_0(In[1])
Generate Out[2] = M_0(In[2])
Generate Out[3] = M_0(In[3])
Backward and optimize with Out[1] and In[1] to generate M_1
Generate Out[4] = M_1(In[4])
Backward and optimize with Out[2] and In[2] to generate M_2
Generate Out[5] = M_2(In[5])
…etc

Note that in my case, deep copying another M_0 and feeding in In[1] will not necessarily produce Out[1] as the procedure contains randomness, so that Out[1] is not reproducible basically.

What is the correct way to program this then?

Thanks in advance!

tom · May 6, 2021, 3:39pm

This is a decidedly difficult pattern. Would it be possible to move taking the gradient (but not the optimizer step) to where the output is generated? That would make it much easier because you only need to keep the gradients around, not the computational graph. The problem here is that you can only take gradients between the forward and changing the weights (so 7 would fail because 5 changed the weights from what they were in 3). If this isn’t an option you could probably replace all the modules to call .clone() on all parameters before using them. This is a bit tedious because you need to re-implement/subclass all the nn.Modules you are using, but it would be feasible.

So the other thing we touched upon is that for this kind of pattern, you probably want to use a list to store the input and output tensors, not a single large tensor you index into.

Best regards

Thomas

Joseph_Kao · May 7, 2021, 9:41am

Hi @tom, thank you so much for your reply. Yes, taking the gradient first is totally fine for my application. I have revised my (actual) code based on your suggestions, and it is now working great.

This is how the minimal working example should look like based on the suggestions:

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

class DNN(nn.Module):
    def __init__(self,dim):
        super(DNN,self).__init__()
        self.fc=nn.Sequential(
            nn.Linear(dim,dim),
            nn.Tanh(),
            nn.Linear(dim,dim)
        )
    def forward(self,data):
        return self.fc(data)

def main():
    E=50
    T=20
    D=5
    dim=4
    model=DNN(dim)
    optimizer=torch.optim.Adam(model.parameters())
    loss_fn=nn.MSELoss()
    data_in=[[torch.zeros(dim) for t in range(T)] for e in range(E)]
    data_out=[[torch.zeros(dim) for t in range(T)] for e in range(E)]
    data_target=[[torch.zeros(dim) for t in range(T)] for e in range(E)]
    optimizer_list=[None for e in range(E)]
    for e in range(E):
        for t in range(T):
            data_in[e][t]=torch.from_numpy(np.random.rand(dim)).float()
            data_out[e][t]=model(data_in[e][t])
            data_target[e][t]=torch.from_numpy(np.random.rand(dim)).float()
        loss=0.0
        for t in range(T):
            loss=loss_fn(data_target[e][t],data_out[e][t])
        optimizer_list[e]=torch.optim.Adam(model.parameters())
        optimizer_list[e].zero_grad()
        loss.backward()
        if e>=D:
            optimizer_list[e-D].step()
    
if __name__=='__main__':
    main()

I have been struggling on this for several weeks, you really save my life @tom. Thanks again!

Joseph_Kao · May 17, 2021, 7:29am

The code from my last post is wrong (doesn’t work as mentioned). Keeping a list of optimizers is useless, we should keep a (actually two) list of models and copy the gradients. The following is the correct code.

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

class DNN(nn.Module):
    def __init__(self,dim):
        super(DNN,self).__init__()
        self.fc=nn.Sequential(
            nn.Linear(dim,dim),
            nn.Tanh(),
            nn.Linear(dim,dim)
        )
    def forward(self,data):
        return self.fc(data)

def main():
    E=50
    T=20
    D=5
    dim=4
    model=DNN(dim)
    optimizer=torch.optim.Adam(model.parameters())
    loss_fn=nn.MSELoss()
    data_in=[[torch.zeros(dim) for t in range(T)] for e in range(E)]
    data_out=[[torch.zeros(dim) for t in range(T)] for e in range(E)]
    data_target=[[torch.zeros(dim) for t in range(T)] for e in range(E)]
    model_list=[None for e in range(E)]
    model_update_list=[None for e in range(E)]
    for e in range(E):
        if e==0:
            model_list[e]=DNN(dim)
        else:
            model_list[e]=type(model_update_list[e-1])(dim)
            model_list[e].load_state_dict(model_update_list[e-1].state_dict())
        for t in range(T):
            data_in[e][t]=torch.from_numpy(np.random.rand(dim)).float()
            data_out[e][t]=model_list[e](data_in[e][t])
            data_target[e][t]=torch.from_numpy(np.random.rand(dim)).float()
        loss=0.0
        for t in range(T):
            loss=loss_fn(data_target[e][t],data_out[e][t])
        optimizer=torch.optim.Adam(model_list[e].parameters())
        optimizer.zero_grad()
        loss.backward()
        model_update_list[e]=type(model_list[e])(dim)
        model_update_list[e].load_state_dict(model_list[e].state_dict())
        optimizer_update=torch.optim.Adam(model_update_list[e].parameters())
        if e>=D:
            for p1,p2 in zip(model_update_list[e].named_parameters(),model_list[e-D].named_parameters()):
                p1[1].grad=p2[1].grad
            optimizer_update.step()
    
if __name__=='__main__':
    main()