Wasserstein loss layer/criterion

AjayTalati · March 22, 2017, 4:04pm

Hello

Are there any plans for an (approximate) Wasserstein loss layer to be implemented - or maybe its already out there?

It’s been in mocha for quite a while,

pluskid/Mocha.jl/blob/master/src/layers/wasserstein-loss.jl

@defstruct WassersteinLossLayer Layer (
  name :: AbstractString = "wasserstein-loss",
  (M :: Matrix = [], !isempty(M)),
  (lambda :: AbstractFloat = 0, lambda > 0),
  (sinkhorn_iter :: Int = 50, sinkhorn_iter > 0),
  (bottoms :: Vector{Symbol} = Symbol[], length(bottoms) == 2)
)
@characterize_layer(WassersteinLossLayer,
  has_loss  => true,
  can_do_bp => true,
  is_sink   => true,
)

type WassersteinLossLayerState{T} <: LayerState
  layer :: WassersteinLossLayer
  loss  :: T

  K     :: Blob
  KM    :: Blob
  alpha :: Blob

This file has been truncated. show original

The theory’s and implementation is a little bit beyond my superficial understanding, (Appendix D), but it seems quite impressive!

AjayTalati · March 22, 2017, 6:23pm

Yes I think that’s their particular application in the paper - but it could be more general than that?

The layer wasserstein-loss.jl coded in Mocha, seems to be fairly flexible.

You might have seen this Python optimal transport library,

github.com

rflamary/POT/blob/master/ot/bregman.py

# -*- coding: utf-8 -*-
"""
Bregman projections for regularized OT
"""

# Author: Remi Flamary <remi.flamary@unice.fr>
#         Nicolas Courty <ncourty@irisa.fr>
#
# License: MIT License

import numpy as np


def sinkhorn(a, b, M, reg, method='sinkhorn', numItermax=1000, stopThr=1e-9, verbose=False, log=False, **kwargs):
    u"""
    Solve the entropic regularization optimal transport problem and return the OT matrix

    The function solves the following optimization problem:

    .. math::

This file has been truncated. show original

I’m guessing that’s what they’ve implemented as their layer? Seems to be a solid piece of theory?

tom · March 22, 2017, 7:07pm

From what I understand, the POT library solves 4.1 (Entropic regularization of the Wasserstein distance, say W(p,q) ), deriving the gradient in 4.2 and the relaxation in 4.3 (first going to W(p_approx,q_approx)+DKL(p_approx,p)+DKL(q_approx,q) and then generalising DKL to allow p/q approx to not be distributions seems to go beyond that.

Funny that they use the difference between MNIST class labels as a metric for the target. That reminded me of your regression approach.

AjayTalati · March 22, 2017, 7:56pm

To be honest, I’m not too sure how to use the POT library yet - but if you want to play around in Mocha, here’s the test of the Wasserstein layer,

github.com

pluskid/Mocha.jl/blob/master/examples/test-wasserstein.jl

data_type = Float32

using Mocha
srand(123)

N = 1000
d = 4
lambda = 0.1
sinkhorn_iter = 10

n_hidden_layer = 2
n_hidden_unit = 64

test_wasserstein = true

X = rand(data_type, d, N)
Y_onehot = zeros(data_type, 1, N)
Y = zeros(data_type, d, N)
for i = 1:N
  Y[indmax(X[:,i]),i] = 1

This file has been truncated. show original

and just for the sake of completness, here’s the code to go with the original paper “Sinkhorn Scaling for Optimal Transport”

http://marcocuturi.net/SI.html

That’s probably the best place to start! Hopefully I’ll be able to make some sense of it all soon?

tom · March 22, 2017, 10:03pm

Hi.
The .jl shows

Performance on Validation Set after 10000 iterations
Accuracy (avg over 1000) = 95.9%"

or so?

float32 does not seem to provide the precision necessary to implement unmodified sinkhorn algorithm, at least in the Python Optimal Transport’s 1-d-OT example.

The log-stabilized sinkhorn algorithm seems to work better at first sight. In the example, the difference between unregularized and regularized is ~1e6 and the difference between numpy@float64 and pytorch@float32 is ~1e-7. Interestingly, the mocha code seems to implement the unstabilized algorithm (unless they are doing the stabilization elsewhere).

I’ll see about making it into a layer on another day.

AjayTalati · March 22, 2017, 10:21pm

Wow, that was quick!

Yes I get about the same,

22-Mar 21:36:03:INFO:root:## Performance on Validation Set after 10000 iterations
22-Mar 21:36:03:INFO:root:---------------------------------------------------------
22-Mar 21:36:03:INFO:root:  Accuracy (avg over 1000) = 95.8000%
22-Mar 21:36:03:INFO:root:---------------------------------------------------------

I’m still going over Marco Cuturi’s Matlab code - implementing a Wasserstein layers quite a long term project - unless you’ve got an immediate application in mind?

For the time being I’m content with just understand it mathematically.

It does seem to have a lot of potential if you want to train a network to give fast approximations to - existing slow simulation algorithms, or algorithms that currently calculate a Wasserstein metric using a linear program - there’s a lot of things like this in scientific computing - that’s actually the application I’ve got in mind.

Chiyuan Zhang (pluskid) has already applied this to computationally intensive geological simulations,

http://pluskid.org/papers/TLE2017-seismic.pdf

So basically - once your nets trained - you can run it forward in batch mode, and get output in second, which previously would have taken hours. Here’s a quote from the peper,

Recent work (Zhang et al., 2014; Frogner et al., 2015;
Dahlke et al., 2016) demonstrates a new approach that builds
and uses a deep neural network (DNN) statistical model to
transform raw-input seismic data directly to the final mapping
of faults in 2D. In this case, fault locations were chosen as the
output because of their relevance to optimizing production in
existing fields. DNNs are built on the premise that they can
be used to replicate any function (in theory, even a nonlinear
one like acoustic wave propagation). We show that DNNs can
be used to identify fault structure in 3D volumes with reasonable
accuracy. The greater promise is that as computational tools
improve, we can use even more complex neural networks to
improve accuracy. The costs are all computational, mostly in
the form of training incurred only once up front. Once the
neural network is trained, predictions can be produced in a
fraction of the training time and of the time needed for traditional
physics modeling with partial differential equations.
This means that nearly instantaneous earth models could be
created as acquisition progresses.

AjayTalati · March 23, 2017, 11:04am

Hi @tom, I just saw one of Marco Cuturi’s recent papers, and if I understood correctly, it gives a method to calculate the divergence between distributions using something like SGD, rather than Sinkhorn’s algorithm

That’s much more understandable, but I’m not sure how easy it is to implement? Basically, it would involve constructing a layer which itself would involve a sgd loop! Pretty funky

I know that there’s already a leanrable quadratic programming layer that’s been implemented, https://github.com/locuslab/qpth but this seems more general than that

Anyway here’s the link, "Stochastic Optimization for Large-scale Optimal Transport "

https://arxiv.org/abs/1605.08527

AjayTalati · March 23, 2017, 1:48pm

I found the code for “Stochastic Optimization for Large-scale Optimal Transport”

https://github.com/audeg/StochasticOT

It doesn’t look too confusing , I think I’m starting to understand what’s going on!

smth · March 23, 2017, 2:22pm

you can optimize divergence between distributions using the Wasserstein metric using a reformulated GAN.
Here is our paper on that:

and the corresponding code (which is awfully simple):

https://github.com/martinarjovsky/WassersteinGAN

AjayTalati · March 23, 2017, 2:44pm

Thanks @smth - seems like there’s quite a few ways of doing the same thing?

Exact calculation using the pyemd module - which is really slow because it does a linear program
i ) If I understand correctly, the wasserstein.jl layer in Mocha uses Sinkhorn’s algorithm to approximate the Wasserstein distance
ii) The code in the repo above which uses something similar to sgd,
iii) Your code which I need to have a good look at.

They should all roughly give the same value!

It should be pretty simple to do the test between the different methods - here’s the example code from case 0)

>>> from pyemd import emd
>>> import numpy as np
>>> first_signature = np.array([0.0, 1.0])
>>> second_signature = np.array([5.0, 3.0])
>>> distance_matrix = np.array([[0.0, 0.5], [0.5, 0.0]])
>>> emd(first_signature, second_signature, distance_matrix)
3.5

tom · March 23, 2017, 7:11pm

Thanks you, @AjayTalati and @smth for the links. (I had seen the Wasserstein GAN paper and code, but did not dig into the code much yet). The Sinkhorn algorithm is iterative, too, but as Genevay et al point out, its “batch” nature may make it prohibitive for large scale applications.

Python Optimal Transport also has an exact solver and compares to the entropy regularized version here: https://github.com/rflamary/POT/blob/master/examples/Demo_1D_OT.ipynb.
(Though in 1D you can just use the CDF_1^(-1)(CDF(x)) transport plan.)

In some applications Wasserstein distances beyond W_1=EMD might be interesting (e.g. the W_2 inner product might be handy sometimes).

The Wasserstein GAN paper and method is awesome, but I am not quite certain that the GAN distance does actually approximate the W_1 distance:

As the authors point out there is the issue whether the supremum is actually attained in the test set of the maximization, (not sure how that compares with the discretization you have to do before using Sinkhorn etc., the linked paper Genevay et al paper kernelizes for the continuous case).
The other thing about the Wasserstein GAN is that the maximizer of equation (3), even if the maximizer of equation (2) is contained in W, will not be a scaled version of the latter. This is not terribly relevant to the ends of the article as you still get the a good norm, so the authors do well to only briefly mention it. On the other hand it would be relevant if you use the W_1 distance for something where you need the W_1 distance itself, so you would need to compute the Lipschitz constant in the maximization procedure and divide by it in the quantity maximized in (3), i.e. for the candidates and not just the maximum.

Well, I better stop the talking get the code in shape to share.

AjayTalati · March 23, 2017, 8:34pm

Hi @tom, it’s really cool that you’re getting interested in this problem. I just want to avoid you going down blind alley’s.

The paper "Stochastic Optimization for Large-scale Optimal Transport " https://arxiv.org/abs/1605.08527, is a conference paper, and they’re usually a bit of a wild card. I spent a while on the code, audeg/StochasticOT, and it’s probably more suited to information retrieval rather than actually training a network like @smth’s GAN code, or as a layer in one. So I think I made a mistake, and it’s perhaps not such a good idea implementing it as a layer.

I don’t want to mislead you - it’s probably a good idea to work on something that’s been proven to be useful. If you simply try to reproduce Chiyuan Zhang (pluskid) Wassertein.jl layer, in the code at the top of this thread, that would be a safe thing to do. That’s something that were reasonably sure work’s, and if you get it working in PyTorch it would be something that you could reference to, and reuse in the future.

Anyway, starting with the easy stuff - at the moment I can’t get the entropy regularised version in,

https://github.com/rflamary/POT/blob/master/examples/Demo_1D_OT.ipynb.

and the exact EMD solver used in PyEMD to give roughly the same number’s? What I mean is, (I think) the functions in the POT library return the transportation plan matrix T, and to get the actual wasserstein/EMD divergence, you calculate, the inner product,

<T,M> = trace(T'M)

Where M is the ground metric matrix. But I get different numbers from both libraries, so I’m not sure which is right? I know you need to tune the regularization parameter lambda, but it should be easy to do that using downhill simplex/Nelder Mead

I want to get these test cases to reconcile - that way we’ve got something to check against when we try to code things, and do more complicated stuff in PyTorch. Otherwise, it’s too easy to make a mistake, without something solid to test against.

AjayTalati · March 23, 2017, 9:18pm

OK, sorry about this @tom , it seems the transport matrices from both PyEMD and POT do roughtly match - do want to check this?


import numpy as np
import matplotlib.pylab as pl
import pyemd 
import ot
from ot.datasets import get_1D_gauss as gauss

#%% parameters

n=100 # nb bins

# bin positions
x=np.arange(n,dtype=np.float64)

# Gaussian distributions
a=gauss(n,m=20,s=5) # m= mean, s= std
b=gauss(n,m=60,s=10)

# loss matrix
M=ot.dist(x.reshape((n,1)),x.reshape((n,1)))
M/=M.max()

#%% EMD with PyEMD library

div, T_pyemd = pyemd.emd_with_flow(a,b,M)
T_pyemd = np.array(T_pyemd)

#%% EMD with POT library

T_pot = ot.emd(a,b,M)

#%% Sinkhorn with POT library

lambd=1e-3
T_sink = ot.sinkhorn(a,b,M,lambd)

#%% plot each of the transport matrices

ot.plot.plot1D_mat(a,b,T_pyemd,'Exact OT matrix using PyEMD')
pl.show()

ot.plot.plot1D_mat(a,b,T_pot,'Exact OT matrix using POT')
pl.show()

ot.plot.plot1D_mat(a,b,T_sink,'Sinkhorn OT matrix using POT')
pl.show()

tom · March 23, 2017, 9:58pm

To compute the distance, you need to integrate:

dist=(M*T).sum()

Something seems to be up with pyemd, though: If you use your example but with larger stddev (e.g. 10 and 30), the “optimal” plan pyemd emits here clearly is broken (as the monotone transport function gives the optimal plan and I see two distinct areas)…

AjayTalati · March 23, 2017, 10:43pm

Hey thanks a lot for that! I think you’ve found something!

I do think PyEMD is a bit finicky, but the distances seem to be roughly the same,

>>> #%% EMD with PyEMD library
... 
>>> div_pyemd, T_pyemd = pyemd.emd_with_flow(a,b,M)
>>> T_pyemd = np.array(T_pyemd)
>>> div_pyemd
0.16684170730400028
>>> 
>>> #%% EMD with POT library
... 
>>> T_pot = ot.emd(a,b,M)
>>> div_pot_emd = np.trace( np.matmul( T_pot.transpose() , M) )
>>> div_pot_emd
0.16580581101906636

I guess then, we should go with the POT library, as that’s probably more reliable?

AjayTalati · March 23, 2017, 10:50pm

On another note, I think their choice of lambda=1e-3 is reasonable, I tried to tune it but it did’nt really make much difference,


#%% Use the downhill simplex method to find lambd
 
import numpy as np
from scipy.optimize import minimize

T_pot = ot.emd(a,b,M)
div_pot_emd = np.trace( np.matmul( T_pot.transpose() , M) )

#objective functon
def tune_lambda(x):
    #%% Sinkhorn approximate EMD using POT library
    T_sink = ot.sinkhorn(a,b,M,x)
    div_pot_sink = np.trace( np.matmul( T_sink.transpose() , M) )
    error = abs(div_pot_emd - div_pot_sink)
    return error

x0 = 1e-3 # initial guess

res = minimize(tune_lambda, x0, method='nelder-mead', options={'xtol': 1e-8, 'disp': True})

print(res.x)

AjayTalati · March 23, 2017, 11:51pm

Some speed tests for you @tom, only x10 at the moment?

It should be faster when i) it’s run on the gpu, ii) the histograms get bigger, iii) increase the dimensions, i.e images or higher dimension tensors, iv) tune lambda

Check out figure 4, from Marco Cuturi’s original paper,

#-----------------------------------------------------------------------------------------
#%% Speed Tests - see https://github.com/rflamary/POT/blob/master/examples/plot_OT_1D.py
#-----------------------------------------------------------------------------------------

import numpy as np
import pyemd 
import ot
import functools
from time import time
from ot.datasets import get_1D_gauss as gauss

#@functools.lru_cache (maxsize=None) # hash(tuple( yadda, yadda
def pyemd_linprog(n,a,b,M):
    start = time()
    for i in range(n): 
        div_pyemd = pyemd.emd(a,b,M)
    run_time = time() - start
    print(run_time)

#@functools.lru_cache (maxsize=None)
def pot_sinkhorn(n,a,b,M,lambd):
    start = time()
    for i in range(n): 
        T_sink = ot.sinkhorn(a,b,M, lambd)
        #div_pot_sink = np.trace( np.matmul( T_sink.transpose() , M) )
    run_time = time() - start
    print(run_time)

#%% parameters

n=128 # nb bins

# bin positions
x=np.arange(n,dtype=np.float64)

# Gaussian distributions
a=gauss(n,m=20,s=5) # m= mean, s= std
b=gauss(n,m=60,s=10)

# loss matrix
M=ot.dist(x.reshape((n,1)),x.reshape((n,1)))
M/=M.max()

##--------
runs=1000
lambd = 1

>>> pyemd_linprog( runs, a, b, M )
21.02236819267273
>>> pot_sinkhorn( runs, a, b, M, lambd )
2.0207273960113525

tom · April 8, 2017, 7:18pm

Hi @AjayTalati,

I started an implementation here: https://github.com/t-vi/pytorch-tvmisc/blob/master/wasserstein-distance/Pytorch_Wasserstein.ipynb

I’m not terribly impressed by the numerical stability, I’ll have to look into that.

Best regards

Thomas

AjayTalati · April 8, 2017, 8:13pm

Hi @tom

Wow !!!

That’s pretty amazing how quickly you did that !!! Very, very impressive.

Let me go over it and try to do some testing - I need to get my slow old brain working !! Be nice to do a test verses the unregularized version, i.e.

ot.lp.emd : Unregularized OT

So now that you have a Wasserstein loss that you can backprop through maybe you want to train a plain vanilla GAN with it, it would be interesting to see how it compares, here’s a quite basic version that’s new, and converges quite fast,

github.com

AjayTalati/generative-models/blob/master/GAN/boundary_equilibrium_gan/began_pytorch.py

import torch
import torch.nn
import torch.nn.functional as nn
import torch.autograd as autograd
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import os
from torch.autograd import Variable
from tensorflow.examples.tutorials.mnist import input_data


mnist = input_data.read_data_sets('../../MNIST_data', one_hot=True)
mb_size = 32
z_dim = 10
X_dim = mnist.train.images.shape[1]
y_dim = mnist.train.labels.shape[1]
h_dim = 128
cnt = 0

This file has been truncated. show original

Be interesting if you could use your loss layer to improve it?

Perhaps you want to get in touch with Rémi Flamary, http://remi.flamary.com/, I’m sure he’ll be very impressed and be bursting with ideas for possible collaboration

Best regards,

Ajay

AjayTalati · April 9, 2017, 11:37am

Hi Tom,

Yes as you said, I’m also getting a lot of Warning: numerical errrors numerical stability warnings?

Your implementations fine, (thanks once again for trying it), I’m guessing the problems simply the inherent instability of the different versions of the SK algorithm??

I’ve tried testing the linear programming ot.emd, against your implementation, and the numpy functions ot.bregman.sinkhorn_epsilon_scaling(a,b,M,1) etc. Like you showed the stabilized algorithm is much more stable than the vanilla version, although the relative rankings are still a little off?

So practically the task we want to address is not really can be reproduce the same values as the linear program emd algorithm. Rather we’re interested in ranking distributions.

To put it simply, if the linear program emd algorithm ranks, A and B, closer than A and C, then any approximate algorithm, (eg Sinkhorn-Knopp), should also give the same relative ranking.

I’m also trying it with discrete distributions, i.e. the histograms used in the original emd paper, but it works much better with Gaussians.

All the best,

Ajay