Quick disclaimer, I’m very new to torchrl, but after some thorough documentation reading, I believe I found some errors in the DQN torchrl tutorial here
The first error I saw was the creation of a egreedy module (placed in exploration_module), but not actually using it. Instead it looks like they used a workaround of 5000 random steps for exploration and then applying a greedy algorithm. What’s weird is they still step the egreedy module within the training loop which does nothing. One would of expected them to place the exploration_module as the policy in the collector, but instead they just passed the raw value network.
The second error, and far more significant, is the fast that that they didn’t set the DQNLoss parameter delay_value = True, which would create the target network (see here). I’m not sure what this means for when they step the softupdate.
Please let me know if I made the error in my understanding, thanks.
P.S. I haven’t submitted a PR on github because I’ve tried to correct for this in my own code and haven’t done so successfully yet.
Probably not the answer, that you would expect, but I can only suggest to drop torchrl and use stable baselines3. In that framework, you can develop your own implementation in PyTorch as well
This PR fixed the doc and the built of the collector
RE TorchRL vs SB3: I’d say that torchrl is as robust as it gets, let’s not overstate what makes a library “industry-standard” or not based on a couple of typos in the doc/tutorials.
That said, SB3 is also a great library (although a totally different spirit of what TorchRL) and I’d definitly encourage you to check it out too!
A great example that comes to mind is how you build a buffer in both libs:
In SB3 (from CleanRL):
Here, data has fields observations, actions etc.
You see that the position of the items in add determine their “role” in the buffer.
Also, a big chunk of the buffer is based on numpy.
It’s awesome if you “just” need a buffer that does exactly what this buffer does, but if you need a more refined usage (eg, you have multiple actions such as in MARL, your split between observations and info is not as clear cut, you want some fancy sampling…) it may not be super scalable.
You see that the buffer is more generic, we passed a storage on device (so our data stays on CUDA) and we could also ask for a sampler that has some extra features (samples slices of trajectories etc).
The philosophy of both libs is very different and it really depends whether you want to apply an existing algo to a specific problem that fits the requirements of the implementation (in which case SB3 is a good fit) or if you want to modify an existing implementation or do something that is a bit more off the charts.