Code for dealing with vanishing gradients in RNNs?

Is there some example code out there somewhere for how to best deal with the vanishing gradient problem as outlined in Pascanu etal 2013?
I have to admit I do not understand how to turn that formula 10 into Pytorch code.
BTW, are there any empirical insights if vanishing or exploding gradients happen more often when using some standard LSTM architecture for sequence classification or tagging?
Gradient clipping appears to be fairly straightforward, even in LSTMs, but the regularization term to use for preventing a vanishing gradient I am not sure about at all.

1 Like