Cannot find where's the in-place operation error

ntuce002 · May 13, 2022, 2:32pm

I’m trying to train the MADDPG model; however, it has occurred an in-place operation error.
Here’s the traceback, and I’ve taken some excerpts that I think are critical:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [16, 2]], which is output 0 of AsStridedBackward0, is at version 3; expected version 2 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

[W …\torch\csrc\autograd\python_anomaly_mode.cpp:104] Warning: Error detected in AddmmBackward0. Traceback of forward call that caused the error:

c_loss, a_loss = all_agents.learn(memory)
File “C:\Users\chhuang\new_dens_speed_offset\maddpg.py”, line 107, in learn
pi = self.agents[idx - 1].actor(car_dens_state[‘%s’ % idx], scooter_dens_state[‘%s’ % idx],
File “C:\Users\chhuang\anaconda3_1\lib\site-packages\torch\nn\modules\module.py”, line 1102, in _call_impl
return forward_call(*input, **kwargs)
File “C:\Users\chhuang\new_dens_speed_offset\networkv4.py”, line 73, in forward
xyzw = self.mu(xyzw)

It showed the error had occurred in networkv4.py, line 73.
The code xyzw = self.mu(xyzw) is in the forward() method, under the Actor() class. The whole code of the Actor() class is down below:

class Actor(nn.Module):
    def __init__(self, alpha, name, chkpt_dir):
        super(Actor, self).__init__()
        
        self.chkpt_file = os.path.join(chkpt_dir, name)

        self.fc1 = nn.Linear(4, 16)
        self.fc2 = nn.Linear(64, 64)
        self.fc3 = nn.Linear(64, 32)
        self.fc4 = nn.Linear(32, 16)
        
        self.bn1 = nn.LayerNorm(16)
        self.bn2 = nn.LayerNorm(64)
        self.bn3 = nn.LayerNorm(32)
        self.bn4 = nn.LayerNorm(16)
        
        self.mu = nn.Linear(16, 2)
        
        self.optimizer = optim.Adam(self.parameters(), lr=alpha)
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.to(self.device)

    def forward(self, car_D, scooter_D, car_S, scooter_S):
        torch.autograd.set_detect_anomaly(True)
        x = self.fc1(car_D)
        x = self.bn1(x)
        x = F.relu(x).flatten(1)
        
        y = self.fc1(scooter_D)
        y = self.bn1(y)
        y = F.relu(y).flatten(1)
        
        z = self.fc1(car_S)
        z = self.bn1(z)
        z = F.relu(z).flatten(1)
        
        w = self.fc1(scooter_S)
        w = self.bn1(w)
        w = F.relu(w).flatten(1)
        
        xyzw = torch.cat([x, y, z, w], dim=1)
        xyzw = self.fc2(xyzw)
        xyzw = self.bn2(xyzw)
        xyzw = F.relu(xyzw)
        
        xyzw = self.fc3(xyzw)
        xyzw = self.bn3(xyzw)
        xyzw = F.relu(xyzw)
        
        xyzw = self.fc4(xyzw)
        xyzw = self.bn4(xyzw)
        xyzw = F.relu(xyzw)
        xyzw = self.mu(xyzw)
        xyzw = torch.sigmoid(xyzw)
        return xyzw

And here’s the link to the whole code of maddpg.py. They are a little bit ugly so I uploaded them to the github instead of posting them here.

I’ve stuck with this problem all day long, and still couldn’t find out where’s the bug. Hope someone can give me some directions to modify my code properly.

ptrblck · May 15, 2022, 11:00pm

I guess the issue might be raised by using

critic_loss.backward(retain_graph=True)
...
actor_loss.backward(retain_graph=True)
...
self.agents[idx - 1].actor.optimizer.step()
self.agents[idx - 1].update_network_parameters()

Using retain_graph=True won’t release the computation graph which often yields these types of errors (e.g. if the parameters were already updated and thus the forward activations are stale as described here.
Could you explain why retain_graph=True is used?

zanga · September 13, 2022, 3:39pm

Hello @ptrblck, I know this is a little outdated but I came across the same error.
The github link doesn’t work no more so I cannot see the whole code but if I’m guessing it right, during learning @ntuce002 has to make multiple passes through the graph (1 for each agent) so it needs to be retained.

Maybe I’m totally wrong and my understanding of the mechnisms behind it are as well but how can it work if we don’t retain the graph ?

Thank you.

ptrblck · September 13, 2022, 6:13pm

I don’t know enough anymore about this use case and was speculating what might be causing the issue. Unfortunately, @ntuce002 didn’t follow up so I don’t know if my guess was right.

zanga · September 20, 2022, 8:16am

Thank you for the reply, I think I’ll open a new post if you have time to check my problem.

ptrblck · September 20, 2022, 8:17am

Sure, you can create a new topic for your issue. If possible, post a minimal and executable code snippet which would reproduce the error, so that I could directly debug it.