How to remove zero padding when splitting a collector trajectory in the PPO tutorial?

In this PPO tutorial, the split_trajs of the SyncDataCollector is False. However, I want to split the collected data in trajectories and learn from them. So if I set this argument to True, data collectors split by orbit are returned, but they are zero-padded. I want to remove this zero padding of the training data.

collector = SyncDataCollector(
    env,
    policy_module,
    frames_per_batch=frames_per_batch,
    total_frames=total_frames,
    split_trajs=True,
    device=device,
)
# ...


for i, tensordict_data in enumerate(collector):
    for td_id, td_trajectory in enumerate(tensordict_data):
        mask = td_trajectory["collector", "mask"]
        # Now I want to erase the zero padding that each tensor in the trajectory tensordict has based on the mask (each tensor has a different size and dimension)

    for _ in range(num_epochs):
        # We'll need an "advantage" signal to make PPO work.
        # We re-compute it at each epoch as its value depends on the value
        # network which is updated in the inner loop.
        advantage_module(tensordict_data)

There are data masks in tensordict["collector", "mask"], but I don’t know how to apply these to the entire tensordict and remove the zero padding comprehensively. The shape and size of each tensordict is of course different, so simply applying torch.masked_select is naturally an error. And I feel that a straightforward implementation would be very cumbersome. Any ideas would be appreciated.

Here is example structure of TensorDict in my environment:

TensorDict(
    fields={
        action: Tensor(shape=torch.Size([14, 1]), device=cuda:0, dtype=torch.float32, is_shared=True),
        collector: TensorDict(
            fields={
                mask: Tensor(shape=torch.Size([14]), device=cuda:0, dtype=torch.bool, is_shared=True),
                traj_ids: Tensor(shape=torch.Size([14]), device=cuda:0, dtype=torch.int64, is_shared=True)},
            batch_size=torch.Size([14]),
            device=cuda:0,
            is_shared=True),
        done: Tensor(shape=torch.Size([14, 1]), device=cuda:0, dtype=torch.bool, is_shared=True),
        loc: Tensor(shape=torch.Size([14, 1]), device=cuda:0, dtype=torch.float32, is_shared=True),
        next: TensorDict(
            fields={
                done: Tensor(shape=torch.Size([14, 1]), device=cuda:0, dtype=torch.bool, is_shared=True),
                observation: Tensor(shape=torch.Size([14, 11]), device=cuda:0, dtype=torch.float32, is_shared=True),
                pixels: Tensor(shape=torch.Size([14, 3, 28, 28]), device=cuda:0, dtype=torch.float32, is_shared=True),
                reward: Tensor(shape=torch.Size([14, 1]), device=cuda:0, dtype=torch.float32, is_shared=True),
                step_count: Tensor(shape=torch.Size([14, 1]), device=cuda:0, dtype=torch.int64, is_shared=True),
                terminated: Tensor(shape=torch.Size([14, 1]), device=cuda:0, dtype=torch.bool, is_shared=True),
                truncated: Tensor(shape=torch.Size([14, 1]), device=cuda:0, dtype=torch.bool, is_shared=True)},
            batch_size=torch.Size([14]),
            device=cuda:0,
            is_shared=True),
        observation: Tensor(shape=torch.Size([14, 11]), device=cuda:0, dtype=torch.float32, is_shared=True),
        pixels: Tensor(shape=torch.Size([14, 3, 28, 28]), device=cuda:0, dtype=torch.float32, is_shared=True),
        sample_log_prob: Tensor(shape=torch.Size([14]), device=cuda:0, dtype=torch.float32, is_shared=True),
        scale: Tensor(shape=torch.Size([14, 1]), device=cuda:0, dtype=torch.float32, is_shared=True),
        step_count: Tensor(shape=torch.Size([14, 1]), device=cuda:0, dtype=torch.int64, is_shared=True),
        terminated: Tensor(shape=torch.Size([14, 1]), device=cuda:0, dtype=torch.bool, is_shared=True),
        truncated: Tensor(shape=torch.Size([14, 1]), device=cuda:0, dtype=torch.bool, is_shared=True)},
    batch_size=torch.Size([14]),
    device=cuda:0,
    is_shared=True)

Thank you

If you want to “unsplit” (remove the padding) you can just do

tensordict[tensordict["collector", "mask"]]

that should do what you want. The only condition is that the mask is right-expandable to the shape of the tensordict, which is the case here.
LMK if I misunderstood what you want to do.

Thanks for giving me the idea.

What I would like to do is to keep the structure of the trajectory splitted, but remove only the zero padding of each trajectory and use that for learning to retrieve only valid, non-zero-padded data.

I thought this could be done using the ideas you mentioned, as follows:

    for i, tensordict_data in enumerate(collector):
        for td_id, td_traj in enumerate(tensordict_data):
          print(td_traj["next", "reward"].shape)
          td_traj = td_traj[td_traj["collector", "mask"]]
          print(td_traj["next", "reward"].shape)

          # Add some changes for experimentation, but shape errors 
          tensordict_data["next", "reward"][td_id] = td_traj.get(("next", "reward")) 

        # we now have a batch of data to work with. Let's learn something from it.
        for _ in range(num_epochs):
            # We'll need an "advantage" signal to make PPO work.
            # We re-compute it at each epoch as its value depends on the value network which is updated in the inner loop.
            advantage_module(tensordict_data)
            data_view = tensordict_data.reshape(-1)
# ...

However, as shown in the code above, it is not possible to assign each unsplit trajectory to the original trajectory because of the different shape in tensordict_data.

I also tried just split_trajs=True and found that it even generated NaN errors. Does what I am trying to do destroy the PPO?

Here is error code:

/home///venv/lib/python3.10/site-packages/torch/autograd/__init__.py:251: UserWarning: Error detected in ExpBackward0. Traceback of forward call that caused the error:
  File "/home///proximal-policy-optimization-for-inverteddoublependulum.py", line 404, in <module>
    main()
  File "/home///proximal-policy-optimization-for-inverteddoublependulum.py", line 38, in main
    train(seed)
  File "/home///proximal-policy-optimization-for-inverteddoublependulum.py", line 300, in train
    loss_vals = loss_module(subdata.to(device))
  File "/home///venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home///venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/home///venv/lib/python3.10/site-packages/tensordict/_contextlib.py", line 126, in decorate_context
    return func(*args, **kwargs)
  File "/home///venv/lib/python3.10/site-packages/tensordict/nn/common.py", line 282, in wrapper
    return func(_self, tensordict, *args, **kwargs)
  File "/home///venv/lib/python3.10/site-packages/torchrl/objectives/ppo.py", line 655, in forward
    gain2 = log_weight_clip.exp() * advantage
 (Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:114.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
  File "/home///proximal-policy-optimization-for-inverteddoublependulum.py", line 404, in <module>
    main()
  File "/home///proximal-policy-optimization-for-inverteddoublependulum.py", line 38, in main
    train(seed)
  File "/home///proximal-policy-optimization-for-inverteddoublependulum.py", line 318, in train
    loss_value.backward()
  File "/home///venv/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/home///venv/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Function 'ExpBackward0' returned nan values in its 0th output.

If you remove the zeros, will you also remove the corresponding valid non-padded data?
Say you have 2 trajs, one of length 3 and the other 5

[[True, True, True, False, False],
[True, True, True, True, True]]

If you want to keep the first dim as 2, you have the following choices:

  • Use split_trajectories as it is, keep padding
  • Use split_trajectories with as_nested=True. WIll give you a bunch of nested tensors which have no padding but limited usage (you can’t do every op with them). I’m working on this here
  • Truncate until the first non-padded value and get
[[True, True, True],
[True, True, True]]

Can you explain what you want to achieve?

I had never heard of as_nested before. Thanks for the deep insight.

I would list what I would like to do:

  1. to experiment with the behaviour of PPO, I would like to split the collected data in trajectories and add an additional formula to td_traj["next", "reward"] for each trajectory like below: for this I have added split_trajs=True as in the code linked below pastecode
    for i, tensordict_data in enumerate(collector):
        for td_id, td_traj in enumerate(tensordict_data):
          # Add some changes for experimentation, but shape errors 
          tensordict_data["next", "reward"][td_id] = td_traj.get(("next", "reward")) + [something terms for customization]

        # we now have a batch of data to work with. Let's learn something from it.
        for _ in range(num_epochs):
            # We'll need an "advantage" signal to make PPO work.
            # We re-compute it at each epoch as its value depends on the value network which is updated in the inner loop.
            advantage_module(tensordict_data)
            data_view = tensordict_data.reshape(-1)
  1. when I ran this, I got a NaN error. I want to solve the NaN error in the PPO code when split_trajs is set to True: this is the full code (external site): Untitled (mahmdrpi) - PasteCode.io

  2. I have come to the speculation that the NaN error is caused by zero padding inserted by split_trajs. That could be acting as noise during learning. And I want to get rid of the zero padding: this is the question I posted here


If you remove the zeros, will you also remove the corresponding valid non-padded data?

I would avoid truncating the padding to the shortest trajectory, as this could also be detrimental to learning.

Also I suspect that keeping the padding as it is, may not be a good learning experience for PPOs and occurs NaN anomaly error like the issue I wrote about above.

So I still don’t really understand what is it that you want if it’s not padded but not a flat list of transitions and not a truncated set of trajectories either. It must be one of these no? Could you give a sketch of what would the data look like?

If you want to apply some transform you can do

mask = data["collector", "mask"]
data_masked = data[mask]
transformed_data = make_something_to_my_data(data_masked)
data[mask] = transformed_data

Another option is to use the as_nested on nightlies (pip install git+https://github.com/pytorch/rl)

data = split_trajectories(data, as_nested=True)
traj_tuple = data.unbind(0) # unbinds the data along dim 0, results in non-nested tensors
traj_tuple = tuple(do_smth(traj) for traj in traj_tuple)
data = traj_tuple[0].apply(lambda *elts: torch.nested.nested_tensor(list(elts)), *traj_tuple[1:], batch_size=data.batch_size)

which will do your transform on each traj independently, without padding.