Wasserstein loss layer/criterion

Hi @AjayTalati,

I started an implementation here: https://github.com/t-vi/pytorch-tvmisc/blob/master/wasserstein-distance/Pytorch_Wasserstein.ipynb

I’m not terribly impressed by the numerical stability, I’ll have to look into that.

Best regards


1 Like

Hi @tom

Wow !!!

That’s pretty amazing how quickly you did that !!! Very, very impressive.

Let me go over it and try to do some testing - I need to get my slow old brain working :slight_smile: !! Be nice to do a test verses the unregularized version, i.e.

ot.lp.emd : Unregularized OT

So now that you have a Wasserstein loss that you can backprop through maybe you want to train a plain vanilla GAN with it, it would be interesting to see how it compares, here’s a quite basic version that’s new, and converges quite fast,

Be interesting if you could use your loss layer to improve it?

Perhaps you want to get in touch with Rémi Flamary, http://remi.flamary.com/, I’m sure he’ll be very impressed and be bursting with ideas for possible collaboration :smile:

Best regards,


Hi Tom,

Yes as you said, I’m also getting a lot of Warning: numerical errrors numerical stability warnings?

Your implementations fine, (thanks once again for trying it), I’m guessing the problems simply the inherent instability of the different versions of the SK algorithm??

I’ve tried testing the linear programming ot.emd, against your implementation, and the numpy functions ot.bregman.sinkhorn_epsilon_scaling(a,b,M,1) etc. Like you showed the stabilized algorithm is much more stable than the vanilla version, although the relative rankings are still a little off?

So practically the task we want to address is not really can be reproduce the same values as the linear program emd algorithm. Rather we’re interested in ranking distributions.

To put it simply, if the linear program emd algorithm ranks, A and B, closer than A and C, then any approximate algorithm, (eg Sinkhorn-Knopp), should also give the same relative ranking.

I’m also trying it with discrete distributions, i.e. the histograms used in the original emd paper, but it works much better with Gaussians.

All the best,


I’ve added a straightforward port of sinkhorn and sinkhorn_stabilized from Python Optimal Transport.
It may be worthwhile to revisit it when float64 GPU processing becomes less expensive. Or one has to come up with a better stabilization.
Note that with a regularisation 1e-3 or 1e-4 the distance seems quite close to the emd distance.

Best regards


1 Like

Hi @tom !!

thanks for adding the functions from the Python Optimal Transport library, they’re definitely going to help save some time testing your implementation - which I think is fine.

I was wondering if you’re interested in applying your PyTorch Wasserstein loss layer code to reproducing the noisy label example in appendix E of Learning with a Wasserstein Loss, (Frogner, Zhang, et al).

It’s a small toy problem, but should be a nice steeping stone to test against before perhaps going on to tackle real datasets, something like the more complicated Flicker tags prediction task that (Frogner, Zhang et al), apply their Wasserstein Loss layer to?

Also there’s a toy dataset in scikit-learn that might be an alternative to the noisy label toy dataset, that (Frogner et al), use?


I’ve been doing some reading and to be honest, of the many different papers I’ve read on applying optimal transport to machine learning, the Frogner et al, Learning with a Wasserstein Loss, paper still seems to be the only one I vaguely understand, (in particular appendices C & D), and think I could ultimately reproduce, and then perhaps apply?

It seems a good idea to initially work with one-hot labels as that seems to simplify things incredibly, see appendix C of Frogner et al. If I understand correctly, in that special case, we don’t even need the SK algorithm, as there’s only one possible transport plan !!! I think that’s quite cute, and should be a lot of fun to see working :smile:, given how simple it is?

All the best,


1 Like

Hi @AjayTalati,

thanks for the pointers! I will definitely look into implementing an application or two.
In the meantime, I jotted down a few thoughts regarding the Improved Training paper for your amusement while we’re waiting for the grad of grad to be merged. :slight_smile: Or maybe we find something to do with the original WGAN code, too.

You have good holidays,

best regards


1 Like

Hi @tom,

Wow, I really liked the post! Your knowledge is very deep, thank you very much for sharing it!

Perhaps it’s possible to compute gradients of gradients, using .backward twice? I could definitely be wrong, but here’s a couple of posts, that you might like,


The toy examples in the improved GAN paper are cute, and it would be fun to implement them :smile:

I guessed you had a physics background, and we both too have/work in credit/actuarial science :smile: - seems like we are almost twins :smile:

I’m sure that it would be possible to train a RNN or MLP to emulate many internal/proprietary credit/actual loss distribution models, which involve costly repeated stochastic sampling. Conditional Wasserstein GAN seems to be perfect for that !!! Once a slow and detailed stochastic model is built by the analyst, it seems like emulating it by training a conditional GAN on it’s output, and then sampling from the trained conditional GAN, is a good strategy for your industry? They should give roughly the same numbers?

I share your view that the accuracy is not so good now, but I guess in the future that will change. For the time being perhaps for reporting purposes, they should be adequate, (nobody reads those reports anyway :wink:).

I wish you the best holidays too my friend,

best regards,


Hi @AjayTalati,

thanks you for the kind words. I do use some things at work, but mainly I do Machine Learning as a hobby.

Regarding Improved Training of Wasserstein GAN, I have implemented the toy examples of the article with pytorch in a Jupyter notebook.
I also included a novel (to me) method that I call Semi-Improved Training of Wasserstein GAN that checks the Lipschitz term directly for a pair (or a pair and an interpolate) of points instead of referring to the gradient. In my limited experiments, this seems to work reasonably well. (Thank you @smth for pointing to your code! Maybe one could use the semi-improved version as a workaround for the feature request https://github.com/martinarjovsky/WassersteinGAN/issues/36 ).

Best regards



Hello all,

I’m having trouble training the wgan - for pretty much every problem I try it on, my outputs all converge onto (what appears to be) the mean of the input.

For instance if I train on a series of 8 gaussians, then my outputs are stuck at ~ (0,0). If I train on images, then it just produces the mean of the images for every output.

I’m using Soumith’s code from above, and have tried all sorts of hyperparameter settings.

Is this type of behaviour common? Do I just need to train for longer?

Any help would be very appreciated! Thank you.

  • Jordan

UPDATE: I added the code, two example outputs (from the start and end of the run) and a plot of the error logs to github: https://github.com/Jordan-Campbell/wgan.

Hi Thomas @tom,

sorry for the late reply :frowning:

I’m waiting for the autograd feature request to be implemented, so that it’s possible to simultaneously experiment with both your Semi-Improved Training, which I think is really innovative :smile: , and the Improved Training / grad penalty formulation.

While we’re waiting I’m looking into, Professor Forcing, (Bengio et al) which is less mathematically rigorous, but seems to be quite an effective way to train RNNs using the GAN paradigm.

I think it’s possible to try implement this by modifying Sean’s @spro seq2seq code,


in which he’s already got a good implementation of teacher-forcing, (thanks Sean :sunny: ). The simplest experiment in the Bengio paper is a character level language model using PennTree Bank, so I guess the best place to start is @yunjey’s implementation. The paper say’s though that professor forcing takes x3 as long to train compared to teacher forcing, so I’ll probably wait until the weekend :smile:, to fully try it out.

I’m guessing that combining the Wasserstein GAN, and Professor Forcing is probably going to be quite a general and simple way of training RNNs, (thought it might not be that fast :blush: )?

Best regards,


Since the pull request was merge does anyone have an example about how one can compute the gradients of gradients in order to implement WGAN-GP?

Thomas, I wanted to say that your blogpost is really really cool. The geometric intuition is really nice, and your semi-improved gans are nice :slight_smile:

1 Like

Thank you @smth! It means a lot to me.

1 Like

So the brand-new, awesome gradient of gradient feature (thanks!) is here and one could try to implement the original Improved training of Wasserstein GAN cost functional.

Thank you @apaszke for your advice on how to improve second derivative handling in
my take on implementing it in pytorch.

Here is a picture of the 25 gaussians:

Best regards


(Edited after improving the loss function formulation.)


Great stuff Thomas :smile:

Maybe you can refer to the work here to find how to use gradient penalty using pytorch


Great stuff! @caogang. I look forward to comparing our toy implementations and also to your implementation of the more serious problems.

Thank you for sharing!

1 Like

Nice work @tom and @caogang !

Tom, I tried to plug your code to the Wassertein code available from the original author, however after calling grad(…, create_graph=True) all the Variables present in errD_gradient have requires_grad set to False and respectively lip+loss.backward() fails since there are no nodes that require a gradient to be computed. Do you have nay idea that may cause this?

Do you have a pointer to your code? Then I would take a look.
Basically, the section in the elif lipschitz_constraint == 3: should do the trick.

            interp_alpha.data.resize_(batch_size, 1)
            interp_points = Variable((interp_alpha.expand_as(input)*input+(1-interp_alpha.expand_as(input))*input2).data, requires_grad=True)
            errD_interp_vec = netD(interp_points)
            errD_gradient, = torch.autograd.grad(errD_interp_vec.sum(), interp_points, create_graph=True)
            lip_est = (errD_gradient**2).view(batch_size,-1).sum(1)
            lip_loss = penalty_weight*((1.0-lip_est)**2).mean(0).view(1)

Best regards


Thank you Thomas, your help is appreciated!

Yes, that is exactly what I plugged in, please see the code here: https://codeshare.io/5Mj9Vn

It is mainly the original code + the bits written by you. Also was the summation ommited intentionally ? ( errD = errD + lip_loss )