How to implement gradient penalty in PyTorch

(Yun Chen) #1

I was reading Improved Training of Wasserstein GANs, and thinking how it could be implemented in PyTorch. It seems not so complex but how to handle gradient penalty in loss troubles me.

In the tensorflow’s implementation, the author use tf.gradients.

I wonder if there is an easy way to handle the gradient penalty.
here is my idea of implementing it, I don’t know whether it will work and work in the way I think:

optimizer_D = optim.adam(model_D.parameters())
x = Variable()
y = Variable()
x_hat = (alpha*x+(1-alpha)*y).detach()
x_hat.requires_grad = True

loss_D = model_D(x_hat).sum()
x_hat.grad.volatile = False

loss = model_D(x).sum() - model_D(y).sum() + ((x_hat.grad -1)**2 * LAMBDA).sum()

How can I optimise the Gradient? Something like grad.backward()
Gradient penalty with respect to the network parameters
(Francisco Massa) #2

For the moment, it’s not yet possible to have gradients of gradients in PyTorch, but there is a pending PR that will implement that and should be merged soon.

(Yun Chen) #3

Great! Can’t wait for that.


@chenyuntc, did you figured out a way of doing this using pytorch? Thanks

(Yun Chen) #5

torch.autograd.grad would help.


Thank you, I had a look into this but from what I see torch doesn’t have yet support for higher-order derivates of Non-linear functions present in the DCGAN model. Or am I wrong?

(Yun Chen) #7

You are right, most function are still old-style which don’t support grad of grad.
There is a temporary fix: use difference rather than differential

x_1,x_2 are sampled from x_hat
idea from 郑华滨

(Yong Lian Hii) #8

Have been struggling with this as well, could you provide an example of how it can be used?

(Thomas V) #9

Ajay and I discussed that a bit a while ago and there is a link to a blog post and Jupyter notebook doing the toy examples from the improved training article in pytorch:

Best regards


(Yun Chen) #10

@caogang is working on it, looking forward to that.

(Marvin Cao) #11

I have finish the toy dataset. You can refer to the implementation

(Marvin Cao) #12

Now I am working on the gan_language, gan_toy is finished. Hope it will be helpful

(Yuanzheng Ci) #13

The idea seems more likely from Thomas’s Semi-Improved Training of Wasserstein GANs or it’s just a coincidence?

(Thomas V) #15

Hi @orashi,

thank you for the credit. I might be among the first to discuss this in detail in this specific context and with a pytorch implementation, but certainly the identification of 1-Lipschitz (in the classical definition) with unit sphere in $W_{1,\infty}$ in the Sobolev scale (which is the fancy mathematician talk for the gradient being bounded by 1) is very standard just as the approximation of the derivative by finite differences (actually, one could fancy-talk that into a different norm, but let’s not), so I would expect many other people to have the same idea independently, so I’d go for coincidence. (Actually sampling two points is a bit different to sampling the center and using the two side points as I did, too.)
What struck me as particularly curious in this case is why the authors of Improved Training chose to do a point-wise derivative test instead of testing the Lipschitz constant directly, but I have not asked them yet, so I don’t know.

Best regards


(Yun Chen) #16

I first find the idea from zhihu(Chinese Quora). The author seems to simply use the difference as an approximation of differential. But someone commented in the article that the difference is actually a better way for Kantorovich dual problem.
The blog from @tom seems both more insightful and more intuitive. Excellent work!

(Thomas V) #17


just a quick update:
A discussed in the SLOGAN blog post, the difference in the above equation is generally not the gradient, but a projection onto x_hat. Thus it would seem to be more prudent to use a one-sided penalty in this formulation.

Best regards