How to implement gradient penalty in PyTorch

You are right, most function are still old-style which don’t support grad of grad.
There is a temporary fix: use difference rather than differential

x_1,x_2 are sampled from x_hat
idea from 郑华滨