How should the "next" field in TorchRL's tensordict be?

matters · December 18, 2024, 5:48am

Using the TorchRL framework and a collector, what should be the output of a rollout for traj_ids, done and next.done when an episode reaches a terminal state?

Considering a 6 steps rollout, I assume it should be as follows:

traj_ids:  [1, 1, 1, 2, 2, 2]
done:      [0, 0, 1, 0, 0, 0]
next.done: [0, 1, 0, 0, 0, 0]

Then, which done is used to compute the Q-value?

vmoens · January 17, 2025, 11:06am

The convention is this:

TensorDict({
   "obs": torch.Tensor(...), # the observation at time t
   "action": torch.Tensor(...), # the action at time t, based on the observation at time t
   "done": torch.Tensor(...), # done state for "obs" at time t - should almost always be False (unless done at reset)
    "next": TensorDict({
        "obs": torch.Tensor(...), # the observation at time t+1, resulting from (obs, action) at time t
        "reward": torch.Tensor(...), # the reward at time t+1, resulting from (obs, action) at time t
        "done": torch.Tensor(...), # the done at time t+1, resulting from (obs, action) at time t - can be True when the transition is terminal
    }, [...]),
}, [...])

Why a “done” state at the root? Just to catch the very rare cases where the reset() returns a state that is done.

The decision on making reward / done belong to t+1 and action to t is based on the famous illustration from Sutton and Barto

Although I personally don’t like it much