I am working on some multi-agent RL training using PPO.
As part of that, I need to calculate the advantage on a per-agent basis which means that I’m taking the data generated by playing the game and masking out parts of it at a time.
This has led to an in-place
error that’s killing the gradient and pytorch’s anomaly detection = True stack trace shows me the value function output from my NN.
Here’s a gist of the appropriate code with the learning code separated out: cleanRL · GitHub
I found this post where they were getting a similar error and fixed it by saving intermediate data into a list and then making that list of tensors into the complete tensor at the end. However, that solution is not working.
Here’s (what I think is) the offending code:
with torch.no_grad():
sort_list = []
advantages = []
returns = []
indices = torch.arange(0, rewards.shape[0]).long().to(device)
next_value = learner.get_value(torch.FloatTensor(next_obs).unsqueeze(0).to(device))
for player in ['player_0', 'player_1']:
mask = np.array(rollouts.agent_id) == player
masked_inds = list(indices[mask])
lastgaelam = 0
for t in reversed(range(mask.sum())):
if t == mask.sum() - 1:
nextnonterminal = 1.0 - next_done
nextvalues = next_value
else:
nextnonterminal = 1.0 - dones[mask][t + 1]
nextvalues = values[mask][t + 1]
delta = rewards[mask][t] + gamma * nextvalues * nextnonterminal - values[mask][t]
lastgaelam = delta + gamma * gae_lambda * nextnonterminal * lastgaelam
advantages.append(lastgaelam)
sort_list.append(masked_inds[t])
returns.append(lastgaelam + values[mask][t])
advantages = torch.cat(advantages)[torch.LongTensor(sort_list)].to(device)
returns = torch.cat(returns)[torch.LongTensor(sort_list)].to(device)
Here’s how the code looks when doing single-agent learning and there’s still indexing and setting into the advantage tensor:
with torch.no_grad():
next_value = agent.get_value(next_obs).reshape(1, -1)
advantages = torch.zeros_like(rewards).to(device)
lastgaelam = 0
for t in reversed(range(args.num_steps)):
if t == args.num_steps - 1:
nextnonterminal = 1.0 - next_done
nextvalues = next_value
else:
nextnonterminal = 1.0 - dones[t + 1]
nextvalues = values[t + 1]
delta = rewards[t] + args.gamma * nextvalues * nextnonterminal - values[t]
advantages[t] = lastgaelam = delta + args.gamma * args.gae_lambda * nextnonterminal * lastgaelam
returns = advantages + values
I guess it’s the masking that is the issue, but I don’t see how else to compute what I need without it.
Here’s an example of multi-agent PPO where all the agents move simultaneously which then means that you can vectorize the advantage calculations along the agent dimension of the data buffers: pettingzoo. farama. org/tutorials/cleanrl/implementing_PPO/
but that doesn’t work for extensive form games where the data is naturally interleaved, hence the masking solutions to independently calculate things.
Full error if anyone wants it:
/home/roque/miniconda3/envs/mapo/lib/python3.9/site-packages/torch/autograd/__init__.py:197: UserWarning: Error detected in AddmmBackward0. Traceback of forward call that caused the error:
File "/mnt/d/PycharmProjects/ubc/mapo/c4_train.py", line 264, in <module>
next_done, batch_reward_main, batch_reward_opp, batch_opponents_scores) = generate_data(
File "/mnt/d/PycharmProjects/ubc/mapo/c4/ppo.py", line 55, in generate_data
rollout_results = rollout(env, learner, opponent, f"main_v{opponent_id}", device)
File "/mnt/d/PycharmProjects/ubc/mapo/c4/utils.py", line 138, in rollout
action, logprob, entropy, value, logits = main_policy.get_action_and_value(observation)
File "/mnt/d/PycharmProjects/ubc/mapo/c4_train.py", line 184, in get_action_and_value
return action, probs.log_prob(action), probs.entropy(), self.critic(x), logits.cpu().detach()
File "/home/roque/miniconda3/envs/mapo/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/roque/miniconda3/envs/mapo/lib/python3.9/site-packages/torch/nn/modules/container.py", line 204, in forward
input = module(input)
File "/home/roque/miniconda3/envs/mapo/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/roque/miniconda3/envs/mapo/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
File "/home/roque/miniconda3/envs/mapo/lib/python3.9/site-packages/torch/fx/traceback.py", line 57, in format_stack
return traceback.format_stack()
(Triggered internally at /opt/conda/conda-bld/pytorch_1670525551200/work/torch/csrc/autograd/python_anomaly_mode.cpp:114.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
File "/mnt/d/PycharmProjects/ubc/mapo/c4_train.py", line 279, in <module>
loss, entropy_loss, pg_loss, v_loss, explained_var, approx_kl, meanclipfracs, old_approx_kl = update_model(
File "/mnt/d/PycharmProjects/ubc/mapo/c4/ppo.py", line 248, in update_model
loss.backward(retain_graph=True)
File "/home/roque/miniconda3/envs/mapo/lib/python3.9/site-packages/torch/_tensor.py", line 488, in backward
torch.autograd.backward(
File "/home/roque/miniconda3/envs/mapo/lib/python3.9/site-packages/torch/autograd/__init__.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [256, 1]], which is output 0 of AsStridedBackward0, is at version 4; expected version 3 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!