Wasserstein loss layer/criterion

I’ve added a straightforward port of sinkhorn and sinkhorn_stabilized from Python Optimal Transport.
It may be worthwhile to revisit it when float64 GPU processing becomes less expensive. Or one has to come up with a better stabilization.
Note that with a regularisation 1e-3 or 1e-4 the distance seems quite close to the emd distance.

Best regards


1 Like

Hi @tom !!

thanks for adding the functions from the Python Optimal Transport library, they’re definitely going to help save some time testing your implementation - which I think is fine.

I was wondering if you’re interested in applying your PyTorch Wasserstein loss layer code to reproducing the noisy label example in appendix E of Learning with a Wasserstein Loss, (Frogner, Zhang, et al).

It’s a small toy problem, but should be a nice steeping stone to test against before perhaps going on to tackle real datasets, something like the more complicated Flicker tags prediction task that (Frogner, Zhang et al), apply their Wasserstein Loss layer to?

Also there’s a toy dataset in scikit-learn that might be an alternative to the noisy label toy dataset, that (Frogner et al), use?


I’ve been doing some reading and to be honest, of the many different papers I’ve read on applying optimal transport to machine learning, the Frogner et al, Learning with a Wasserstein Loss, paper still seems to be the only one I vaguely understand, (in particular appendices C & D), and think I could ultimately reproduce, and then perhaps apply?

It seems a good idea to initially work with one-hot labels as that seems to simplify things incredibly, see appendix C of Frogner et al. If I understand correctly, in that special case, we don’t even need the SK algorithm, as there’s only one possible transport plan !!! I think that’s quite cute, and should be a lot of fun to see working :smile:, given how simple it is?

All the best,


1 Like

Hi @AjayTalati,

thanks for the pointers! I will definitely look into implementing an application or two.
In the meantime, I jotted down a few thoughts regarding the Improved Training paper for your amusement while we’re waiting for the grad of grad to be merged. :slight_smile: Or maybe we find something to do with the original WGAN code, too.

You have good holidays,

best regards


1 Like

Hi @tom,

Wow, I really liked the post! Your knowledge is very deep, thank you very much for sharing it!

Perhaps it’s possible to compute gradients of gradients, using .backward twice? I could definitely be wrong, but here’s a couple of posts, that you might like,


The toy examples in the improved GAN paper are cute, and it would be fun to implement them :smile:

I guessed you had a physics background, and we both too have/work in credit/actuarial science :smile: - seems like we are almost twins :smile:

I’m sure that it would be possible to train a RNN or MLP to emulate many internal/proprietary credit/actual loss distribution models, which involve costly repeated stochastic sampling. Conditional Wasserstein GAN seems to be perfect for that !!! Once a slow and detailed stochastic model is built by the analyst, it seems like emulating it by training a conditional GAN on it’s output, and then sampling from the trained conditional GAN, is a good strategy for your industry? They should give roughly the same numbers?

I share your view that the accuracy is not so good now, but I guess in the future that will change. For the time being perhaps for reporting purposes, they should be adequate, (nobody reads those reports anyway :wink:).

I wish you the best holidays too my friend,

best regards,


Hi @AjayTalati,

thanks you for the kind words. I do use some things at work, but mainly I do Machine Learning as a hobby.

Regarding Improved Training of Wasserstein GAN, I have implemented the toy examples of the article with pytorch in a Jupyter notebook.
I also included a novel (to me) method that I call Semi-Improved Training of Wasserstein GAN that checks the Lipschitz term directly for a pair (or a pair and an interpolate) of points instead of referring to the gradient. In my limited experiments, this seems to work reasonably well. (Thank you @smth for pointing to your code! Maybe one could use the semi-improved version as a workaround for the feature request https://github.com/martinarjovsky/WassersteinGAN/issues/36 ).

Best regards



Hello all,

I’m having trouble training the wgan - for pretty much every problem I try it on, my outputs all converge onto (what appears to be) the mean of the input.

For instance if I train on a series of 8 gaussians, then my outputs are stuck at ~ (0,0). If I train on images, then it just produces the mean of the images for every output.

I’m using Soumith’s code from above, and have tried all sorts of hyperparameter settings.

Is this type of behaviour common? Do I just need to train for longer?

Any help would be very appreciated! Thank you.

  • Jordan

UPDATE: I added the code, two example outputs (from the start and end of the run) and a plot of the error logs to github: https://github.com/Jordan-Campbell/wgan.

Hi Thomas @tom,

sorry for the late reply :frowning:

I’m waiting for the autograd feature request to be implemented, so that it’s possible to simultaneously experiment with both your Semi-Improved Training, which I think is really innovative :smile: , and the Improved Training / grad penalty formulation.

While we’re waiting I’m looking into, Professor Forcing, (Bengio et al) which is less mathematically rigorous, but seems to be quite an effective way to train RNNs using the GAN paradigm.

I think it’s possible to try implement this by modifying Sean’s @spro seq2seq code,


in which he’s already got a good implementation of teacher-forcing, (thanks Sean :sunny: ). The simplest experiment in the Bengio paper is a character level language model using PennTree Bank, so I guess the best place to start is @yunjey’s implementation. The paper say’s though that professor forcing takes x3 as long to train compared to teacher forcing, so I’ll probably wait until the weekend :smile:, to fully try it out.

I’m guessing that combining the Wasserstein GAN, and Professor Forcing is probably going to be quite a general and simple way of training RNNs, (thought it might not be that fast :blush: )?

Best regards,


Since the pull request was merge does anyone have an example about how one can compute the gradients of gradients in order to implement WGAN-GP?

Thomas, I wanted to say that your blogpost is really really cool. The geometric intuition is really nice, and your semi-improved gans are nice :slight_smile:

1 Like

Thank you @smth! It means a lot to me.

1 Like

So the brand-new, awesome gradient of gradient feature (thanks!) is here and one could try to implement the original Improved training of Wasserstein GAN cost functional.

Thank you @apaszke for your advice on how to improve second derivative handling in
my take on implementing it in pytorch.

Here is a picture of the 25 gaussians:

Best regards


(Edited after improving the loss function formulation.)


Great stuff Thomas :smile:

Maybe you can refer to the work here to find how to use gradient penalty using pytorch


Great stuff! @caogang. I look forward to comparing our toy implementations and also to your implementation of the more serious problems.

Thank you for sharing!

1 Like

Nice work @tom and @caogang !

Tom, I tried to plug your code to the Wassertein code available from the original author, however after calling grad(…, create_graph=True) all the Variables present in errD_gradient have requires_grad set to False and respectively lip+loss.backward() fails since there are no nodes that require a gradient to be computed. Do you have nay idea that may cause this?

Do you have a pointer to your code? Then I would take a look.
Basically, the section in the elif lipschitz_constraint == 3: should do the trick.

            interp_alpha.data.resize_(batch_size, 1)
            interp_points = Variable((interp_alpha.expand_as(input)*input+(1-interp_alpha.expand_as(input))*input2).data, requires_grad=True)
            errD_interp_vec = netD(interp_points)
            errD_gradient, = torch.autograd.grad(errD_interp_vec.sum(), interp_points, create_graph=True)
            lip_est = (errD_gradient**2).view(batch_size,-1).sum(1)
            lip_loss = penalty_weight*((1.0-lip_est)**2).mean(0).view(1)

Best regards


Thank you Thomas, your help is appreciated!

Yes, that is exactly what I plugged in, please see the code here: https://codeshare.io/5Mj9Vn

It is mainly the original code + the bits written by you. Also was the summation ommited intentionally ? ( errD = errD + lip_loss )


Hi Adrian,

I must admit that I cannot say why the error message is what it is, but I would guess that the network uses ops that have not yet been converted to the new style autograd. These will come, but only piece by piece (I should really convert a few more, but I was busy today).
There is a pull request open by @caogang: https://github.com/pytorch/pytorch/pull/1507

Note that I reimplemented LeakyReLU and I think BatchNorm is open, too.

Best regards


No worries, thanks for having a look! Probably you are right and it’s caused by the missing ops.

Thanks a lot,

Has nn.Conv2d been converted to the new style autograd?