Gradient Rescaling in Backpropagation

Dueling DQN, a simple improvement over DQN, splits fully connected layers into two branches. quote from the paper:
“we rescale the combined gradient entering the last convolutional layer by 1/√2. This simple heuristic mildly increases stability”

How do we achieve this in PyTorch? I noticed that most of the popular Pytorch-based RL frameworks did not even implement this rescaling (tianshou, torchrl, …), is there any reason behind this?

Thanks in advance

Hi @seer_mer
I personally did not test it on my TorchRL implementation. My understanding is that because it “feeds” two networks, the gradient is rescaled as it there was only one.
It feels a bit ad-hoc, not well documented and I don’t see any tests or precise mathematical justification for this choice. If we were to adopt this in TorchRL, we’d have to justify it just by quoting the paper (without being able to put a precise rationale behind it). Also, it’s the kind of things that is surprising to the user if they want to play around with the algorithm (what should I do with this line? Is it crucial?). Finally, it would be surprising that the return really depends on that trick.
Overall, I think our role of developers of this kind of library is less to reproduce exactly a paper’s result rather than providing the tools for you to build upon their results. In this case, I think we all judge that this does not constitute the core contribution of the paper.
Hope that answers the question.

1 Like

Thank you, that helped me to understand the reason behind the design choice.