Suspected error in DQN tutorial:

Hi everyone,

Quick disclaimer, I’m very new to torchrl, but after some thorough documentation reading, I believe I found some errors in the DQN torchrl tutorial here

  1. The first error I saw was the creation of a egreedy module (placed in exploration_module), but not actually using it. Instead it looks like they used a workaround of 5000 random steps for exploration and then applying a greedy algorithm. What’s weird is they still step the egreedy module within the training loop which does nothing. One would of expected them to place the exploration_module as the policy in the collector, but instead they just passed the raw value network.
  2. The second error, and far more significant, is the fast that that they didn’t set the DQNLoss parameter delay_value = True, which would create the target network (see here). I’m not sure what this means for when they step the softupdate.
    Please let me know if I made the error in my understanding, thanks.
    P.S. I haven’t submitted a PR on github because I’ve tried to correct for this in my own code and haven’t done so successfully yet.

Thanks for reporting!

I agree with (1) and we should use policy_explore in the collector.

For (2) it’s just a problem of doc, we changed the default and forgot to update the doc strings!

I will create issues in torchrl to track these.

1 Like

Probably not the answer, that you would expect, but I can only suggest to drop torchrl and use stable baselines3. In that framework, you can develop your own implementation in PyTorch as well

1 Like

Thanks for the tip, is stable basline3 more industry standard?

This PR fixed the doc and the built of the collector

RE TorchRL vs SB3: I’d say that torchrl is as robust as it gets, let’s not overstate what makes a library “industry-standard” or not based on a couple of typos in the doc/tutorials.

That said, SB3 is also a great library (although a totally different spirit of what TorchRL) and I’d definitly encourage you to check it out too!

A great example that comes to mind is how you build a buffer in both libs:
In SB3 (from CleanRL):

    rb = ReplayBuffer(
        args.buffer_size,
        envs.single_observation_space,
        envs.single_action_space,
        device,
        handle_timeout_termination=False,
    )
[...] later
rb.add(obs, real_next_obs, actions, rewards, terminations, infos)
[...]
data = rb.sample(args.batch_size)
data.observations

Here, data has fields observations, actions etc.
You see that the position of the items in add determine their “role” in the buffer.
Also, a big chunk of the buffer is based on numpy.
It’s awesome if you “just” need a buffer that does exactly what this buffer does, but if you need a more refined usage (eg, you have multiple actions such as in MARL, your split between observations and info is not as clear cut, you want some fancy sampling…) it may not be super scalable.

TorchRL’s buffer is more generic and modular:

rb = ReplayBuffer(storage=LazyTensorStorage(args.buffer_size, device=device))
[...]
transition = TensorDict(
            observations=obs,
            next_observations=real_next_obs,
            actions=torch.as_tensor(actions, device=device, dtype=torch.float),
            rewards=torch.as_tensor(rewards, device=device, dtype=torch.float),
            terminations=terminations,
            dones=terminations,
            batch_size=obs.shape[0],
            device=device
)
rb.extend(transition)
data = rb.sample(batch_size)
data["observations"]

You see that the buffer is more generic, we passed a storage on device (so our data stays on CUDA) and we could also ask for a sampler that has some extra features (samples slices of trajectories etc).

The philosophy of both libs is very different and it really depends whether you want to apply an existing algo to a specific problem that fits the requirements of the implementation (in which case SB3 is a good fit) or if you want to modify an existing implementation or do something that is a bit more off the charts.