Wasserstein loss layer/criterion

(Thomas V) #43


yes, conv double backward has been merged into master yesterday:
https://github.com/pytorch/pytorch/pull/1832 based on

I think most of the other functions are there as well.
Thanks to those involved for all the hard work!

Best regards


(Jarrel Seah) #44

Is BatchNorm supported? I am getting the error that batchnormbackward is not supported.

(Jarrel Seah) #45

Hi Tom,

I have been trying out your improvements to WGAN (thanks!) - for lipschitz_constraint == 1 in https://github.com/t-vi/pytorch-tvmisc/blob/master/wasserstein-distance/Improved_Training_of_Wasserstein_GAN.ipynb

Do you think taking the mean instead of the sum in this line: dist = ((vinput-vinput2)^2).sum(1)^0.5 would work better for high dimensional inputs as the Euclidean norm (esp for images) quickly exceeds the order of magnitude of the discriminator outputs.

My initial experiments appear to suggest the mean does work for images but not the sum.

(Csaba Botos) #46


Your code is very clear, and helpful (at least for me).
Please let me understand this algorithm better. you backprop 1 for fake and -1 for real in this code:

While this notebook of @tom (which is enlightening as well) https://github.com/t-vi/pytorch-tvmisc/blob/master/wasserstein-distance/Semi-Improved_Training_of_Wasserstein_GAN.ipynb backprops -1 for fake and 1 for real.

Maybe I mistook some calculations, but if I’m correct then we should decrease the output for real (since EM distance measures how far we are from the real distribution -> so for real we should decrease this) and increase the output for fake (where output is the output value of discriminator/critic). This means your code is correct, still I’m confused because @tom-s code is working as it seems to me.

Or is this some kind of a symmetric case where we only have to increase the distance between real and fake discrimination, no matter how?

Thank you :slight_smile:

(Marvin Cao) #47

Maybe you are right. In my opinion, the code in @tom regards the output of discriminator as the error while my code takes it as the Wasserstein Distance. They are optimized in the opposite gradient directions. I think maybe both ways will lead the algorithm converging, while I haven’t test it using my current code. (Maybe I will test it in a few days)

(Csaba Botos) #48

But for general purpose of understanding:
From the point of critic - we want to decrease Wasserstein Distance of the critics’ implicit probability density with regards to the real probability, while increase the Wasserstein Distance with regards to the **generator’**s implicit probability density. Am I right?

(Marvin Cao) #49

Yes, you are right. Explaining in this way is more understandable.:smiley:

(Thomas V) #50

Hi Csaba, Jarrel,

thank you for looking at this in detail!

I must admit that the mathematician in me cringes a bit @botcs’s argument.
As @jarrelscy mentions, this is symmetric (it is a distance after all).

What happens mathematically is that the discriminator - the test function in the supremum - will ideally converge to the negative of what you get when you switch the signs between real and fake. The only important thing is to have opposing signs between the pair (real discr, fake discr) in the discriminator and (fake discr, fake gen) in the generator, the latter is because we want to maximize the difference between the integrals in the discriminator but do so by minimizing the negative.

So approximately (if the penalty term were zero because the weight was infinite) the Wasserstein distance is the negative loss of the discriminator and the loss of the generator lacks the subtraction of the integral on the real to be the true Wasserstein distance - as this term does not enter the gradient anyway, is is not computed. This is independent of how you pick the signs.

I took the signs from the WGAN code published by Martin Arjowsky on github.

Note that some time after writing the notebook you linked, I arrived at the conclusion that one-sided loss is better.

Best regards


(Thomas V) #51

Hi Jarrel,

Thank you for the observation

between this and the magnitude of the penalty parameter, it should be equivalent to move it. What you effectively are doing is changing the metric on the image space (from euclidean distance to euclidean distance / number of pixels).

That said, it would seem to be good to not have a penalty term that is overly large compared to the “primary terms”. My impression is that for the one-sided penalty larger weights are much less problematic, as witnessed by the toy problems.

Best regards


(YangFangshu) #52

Hi Tom,

    I am so impressive about your code  https://github.com/t-vi/pytorch-tvmisc/blob/master/wasserstein-distance/Pytorch_Wasserstein.ipynb. You have shown the example of 1D data, but I want to know how to calculate the Wasserstein loss  about 2D training dataset. How to show distance between two 2D matrix ? A 3D matrix? 

   Be interesting if you could show the example of 2D data.

Best regards,

F.S. Yang

(Thomas V) #53


great question!
It should work as for more dimensions in the sense that you just need to plug in the right distance function. The number of bins could be a limitation though, as with 100 datapoints you only get a 10x10 grid.
I should include a sample, really.

Best regards


(YangFangshu) #54

Hi tom,

      Thank you for your reply. In computer vision, we often process 2D images, I find that computing the Wasserstein loss between two 2D matrix iteratively is so computational expensive. How can we deal with this problem? If we need to downsample the prediction and the target? And if the loss between downsampling matrix is not very reliable for training a network?

      I hope you will show us an example of 2D data.

Best regards,

F.S. Yang

(Avhirup Chakraborty) #55

I face the face problem