I’m trying to understand how the adam optimizer was implemented in pytorch.
Basically, I would like to penalize the returned loss with an l_2
norm of some noise variable (for use in a specific problem).
The way the loss is written is not as intuitive as the paper seems to argue. What’s the advisable way to do this?