Proper way to do gradient clipping?

You can safely modify Variable.grad.data in-place after the backward pass finishes. For example see how it’s done in the language modelling example.

The reason for that is that it has a nice user facing API where you have both weight tensors exposed. Also, it opens up a possibility of doing batched matrix multiply on the inputs for all steps, and then only applying the hidden-to-hidden weights (it’s not yet added there). If you measure the overhead and prove us that it can be implemented in a clean and fast way, we’ll happily accept a PR or change it.

6 Likes