How to implement gradient penalty in PyTorch

chenyuntc · April 5, 2017, 7:24am

I was reading Improved Training of Wasserstein GANs, and thinking how it could be implemented in PyTorch. It seems not so complex but how to handle gradient penalty in loss troubles me.

In the tensorflow’s implementation, the author use tf.gradients.

github.com

igul222/improved_wgan_training/blob/master/gan_cifar.py#L132-L136


      
          gradients = tf.gradients(Discriminator(interpolates), [interpolates])[0]
          slopes = tf.sqrt(tf.reduce_sum(tf.square(gradients), reduction_indices=[1]))
          gradient_penalty = tf.reduce_mean((slopes-1.)**2)
          disc_cost += LAMBDA*gradient_penalty

I wonder if there is an easy way to handle the gradient penalty.
here is my idea of implementing it, I don’t know whether it will work and work in the way I think:


optimizer_D = optim.adam(model_D.parameters())
x = Variable()
y = Variable()
x_hat = (alpha*x+(1-alpha)*y).detach()
x_hat.requires_grad = True

loss_D = model_D(x_hat).sum()
loss_D.backward()
x_hat.grad.volatile = False

loss = model_D(x).sum() - model_D(y).sum() + ((x_hat.grad -1)**2 * LAMBDA).sum()
loss.backward()
optimizer_D.step()

fmassa · April 5, 2017, 10:15am

For the moment, it’s not yet possible to have gradients of gradients in PyTorch, but there is a pending PR that will implement that and should be merged soon.

chenyuntc · April 5, 2017, 10:28am

Great! Can’t wait for that.

adrian · May 7, 2017, 2:19pm

@chenyuntc, did you figured out a way of doing this using pytorch? Thanks

chenyuntc · May 7, 2017, 4:22pm

torch.autograd.grad would help.

adrian · May 7, 2017, 4:52pm

Thank you, I had a look into this but from what I see torch doesn’t have yet support for higher-order derivates of Non-linear functions present in the DCGAN model. Or am I wrong?

chenyuntc · May 8, 2017, 2:42am

You are right, most function are still old-style which don’t support grad of grad.
There is a temporary fix: use difference rather than differential

x_1,x_2 are sampled from x_hat
idea from 郑华滨

TheQuickBrownFox · May 8, 2017, 5:30am

Have been struggling with this as well, could you provide an example of how it can be used?

tom · May 8, 2017, 7:56am

Ajay and I discussed that a bit a while ago and there is a link to a blog post and Jupyter notebook doing the toy examples from the improved training article in pytorch:

Best regards

Thomas

chenyuntc · May 8, 2017, 2:49pm

@caogang is working on it, looking forward to that.

caogang · May 9, 2017, 12:22pm

I have finish the toy dataset. You can refer to the implementation

caogang · May 9, 2017, 12:22pm

Now I am working on the gan_language, gan_toy is finished. Hope it will be helpful

orashi · May 9, 2017, 3:55pm

The idea seems more likely from Thomas’s Semi-Improved Training of Wasserstein GANs or it’s just a coincidence?

tom · May 9, 2017, 4:58pm

Hi @orashi,

thank you for the credit. I might be among the first to discuss this in detail in this specific context and with a pytorch implementation, but certainly the identification of 1-Lipschitz (in the classical definition) with unit sphere in $W_{1,\infty}$ in the Sobolev scale (which is the fancy mathematician talk for the gradient being bounded by 1) is very standard just as the approximation of the derivative by finite differences (actually, one could fancy-talk that into a different norm, but let’s not), so I would expect many other people to have the same idea independently, so I’d go for coincidence. (Actually sampling two points is a bit different to sampling the center and using the two side points as I did, too.)
What struck me as particularly curious in this case is why the authors of Improved Training chose to do a point-wise derivative test instead of testing the Lipschitz constant directly, but I have not asked them yet, so I don’t know.

Best regards

Thomas

chenyuntc · May 10, 2017, 1:45am

I first find the idea from zhihu(Chinese Quora). The author seems to simply use the difference as an approximation of differential. But someone commented in the article that the difference is actually a better way for Kantorovich dual problem.
The blog from @tom seems both more insightful and more intuitive. Excellent work!

tom · June 1, 2017, 9:22am

Hi,

just a quick update:
A discussed in the SLOGAN blog post, the difference in the is generally not the gradient, but a projection onto x_hat. Thus it would seem to be more prudent to use a one-sided penalty in this formulation.

Best regards

Thomas

Yash_Rathi · February 11, 2022, 12:25pm

Could someone please explain, how would you find a gradient of a gradient mathematically?