Cannot find where's the in-place operation error

I’m trying to train the MADDPG model; however, it has occurred an in-place operation error.
Here’s the traceback, and I’ve taken some excerpts that I think are critical:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [16, 2]], which is output 0 of AsStridedBackward0, is at version 3; expected version 2 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

[W …\torch\csrc\autograd\python_anomaly_mode.cpp:104] Warning: Error detected in AddmmBackward0. Traceback of forward call that caused the error:

c_loss, a_loss = all_agents.learn(memory)
File “C:\Users\chhuang\new_dens_speed_offset\maddpg.py”, line 107, in learn
pi = self.agents[idx - 1].actor(car_dens_state[’%s’ % idx], scooter_dens_state[’%s’ % idx],
File “C:\Users\chhuang\anaconda3_1\lib\site-packages\torch\nn\modules\module.py”, line 1102, in _call_impl
return forward_call(*input, **kwargs)
File “C:\Users\chhuang\new_dens_speed_offset\networkv4.py”, line 73, in forward
xyzw = self.mu(xyzw)

It showed the error had occurred in networkv4.py, line 73.
The code xyzw = self.mu(xyzw) is in the forward() method, under the Actor() class. The whole code of the Actor() class is down below:

class Actor(nn.Module):
    def __init__(self, alpha, name, chkpt_dir):
        super(Actor, self).__init__()
        
        self.chkpt_file = os.path.join(chkpt_dir, name)

        self.fc1 = nn.Linear(4, 16)
        self.fc2 = nn.Linear(64, 64)
        self.fc3 = nn.Linear(64, 32)
        self.fc4 = nn.Linear(32, 16)
        
        self.bn1 = nn.LayerNorm(16)
        self.bn2 = nn.LayerNorm(64)
        self.bn3 = nn.LayerNorm(32)
        self.bn4 = nn.LayerNorm(16)
        
        self.mu = nn.Linear(16, 2)
        
        self.optimizer = optim.Adam(self.parameters(), lr=alpha)
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.to(self.device)

    def forward(self, car_D, scooter_D, car_S, scooter_S):
        torch.autograd.set_detect_anomaly(True)
        x = self.fc1(car_D)
        x = self.bn1(x)
        x = F.relu(x).flatten(1)
        
        y = self.fc1(scooter_D)
        y = self.bn1(y)
        y = F.relu(y).flatten(1)
        
        z = self.fc1(car_S)
        z = self.bn1(z)
        z = F.relu(z).flatten(1)
        
        w = self.fc1(scooter_S)
        w = self.bn1(w)
        w = F.relu(w).flatten(1)
        
        xyzw = torch.cat([x, y, z, w], dim=1)
        xyzw = self.fc2(xyzw)
        xyzw = self.bn2(xyzw)
        xyzw = F.relu(xyzw)
        
        xyzw = self.fc3(xyzw)
        xyzw = self.bn3(xyzw)
        xyzw = F.relu(xyzw)
        
        xyzw = self.fc4(xyzw)
        xyzw = self.bn4(xyzw)
        xyzw = F.relu(xyzw)
        xyzw = self.mu(xyzw)
        xyzw = torch.sigmoid(xyzw)
        return xyzw

And here’s the link to the whole code of maddpg.py. They are a little bit ugly so I uploaded them to the github instead of posting them here.

I’ve stuck with this problem all day long, and still couldn’t find out where’s the bug. Hope someone can give me some directions to modify my code properly.

I guess the issue might be raised by using

critic_loss.backward(retain_graph=True)
...
actor_loss.backward(retain_graph=True)
...
self.agents[idx - 1].actor.optimizer.step()
self.agents[idx - 1].update_network_parameters()

Using retain_graph=True won’t release the computation graph which often yields these types of errors (e.g. if the parameters were already updated and thus the forward activations are stale as described here.
Could you explain why retain_graph=True is used?

Hello @ptrblck, I know this is a little outdated but I came across the same error.
The github link doesn’t work no more so I cannot see the whole code but if I’m guessing it right, during learning @ntuce002 has to make multiple passes through the graph (1 for each agent) so it needs to be retained.

Maybe I’m totally wrong and my understanding of the mechnisms behind it are as well but how can it work if we don’t retain the graph ?

Thank you.

I don’t know enough anymore about this use case and was speculating what might be causing the issue. Unfortunately, @ntuce002 didn’t follow up so I don’t know if my guess was right.

Thank you for the reply, I think I’ll open a new post if you have time to check my problem.

Sure, you can create a new topic for your issue. If possible, post a minimal and executable code snippet which would reproduce the error, so that I could directly debug it.

1 Like