Gradient Vanishing with Wasserstein distance

XZLeo · February 12, 2023, 3:50pm

I’m training GoogleNet with a simplified Wasserstein distance (also known as earth mover distance) as the loss function for 100 classification problem. Since the gnd is a one-hot distribution, the loss is the weighted sum of the absolute value of each class id minus the gnd class id.

p_i is the softmax output.

It is defined as follows:

class WassersteinClass(nn.Module):
    def __init__(self) -> None:
        super().__init__()

    def forward(self, likelihood, gnd_idx):
        batch_size = 1
        l = likelihood.shape[1]  # number of bins  100

        gnd_idx = gnd_idx.reshape((batch_size, 1)) 
        idxs = torch.arange(0, l, dtype=torch.float32).to(device=config.device, non_blocking=True)
        batch_idxs = idxs.repeat(batch_size, 1)
        D = torch.abs(batch_idxs-gnd_idx) # broadcast
        
        loss = torch.sum(likelihood*D)
        return loss

The gradient vanishes after several epochs. I printed the output of softmax (theta), theta’s gradient, and softmax input’s gradient.

theta tensor([[0.0100, 0.0099, 0.0100, 0.0100, 0.0097, 0.0100, 0.0104, 0.0101, 0.0099,
         0.0101, 0.0103, 0.0100, 0.0097, 0.0099, 0.0102, 0.0101, 0.0099, 0.0097,
         0.0098, 0.0098, 0.0098, 0.0097, 0.0101, 0.0101, 0.0103, 0.0099, 0.0098,
         0.0098, 0.0097, 0.0098, 0.0103, 0.0098, 0.0103, 0.0101, 0.0102, 0.0097,
         0.0101, 0.0103, 0.0101, 0.0103, 0.0102, 0.0101, 0.0099, 0.0099, 0.0099,
         0.0102, 0.0103, 0.0100, 0.0099, 0.0101, 0.0103, 0.0102, 0.0100, 0.0097,
         0.0102, 0.0101, 0.0097, 0.0100, 0.0103, 0.0099, 0.0100, 0.0101, 0.0100,
         0.0099, 0.0099, 0.0102, 0.0097, 0.0103, 0.0097, 0.0097, 0.0101, 0.0103,
         0.0099, 0.0097, 0.0099, 0.0102, 0.0099, 0.0097, 0.0098, 0.0097, 0.0098,
         0.0101, 0.0102, 0.0101, 0.0103, 0.0098, 0.0099, 0.0102, 0.0100, 0.0102,
         0.0101, 0.0100, 0.0104, 0.0100, 0.0101, 0.0102, 0.0099, 0.0098, 0.0099,
         0.0097]], device='cuda:0', grad_fn=<SoftmaxBackward0>)
theta gradient before softmax tensor([[ 0.3390,  0.3260,  0.3206,  0.3079,  0.2902,  0.2891,  0.2899,  0.2720,
          0.2572,  0.2509,  0.2461,  0.2291,  0.2130,  0.2068,  0.2030,  0.1915,
          0.1780,  0.1647,  0.1557,  0.1466,  0.1360,  0.1253,  0.1206,  0.1098,
          0.1025,  0.0881,  0.0772,  0.0678,  0.0575,  0.0479,  0.0404,  0.0286,
          0.0198,  0.0093, -0.0009, -0.0105, -0.0212, -0.0318, -0.0413, -0.0524,
         -0.0623, -0.0714, -0.0800, -0.0899, -0.1002, -0.1131, -0.1246, -0.1314,
         -0.1401, -0.1525, -0.1649, -0.1741, -0.1811, -0.1844, -0.2040, -0.2134,
         -0.2137, -0.2306, -0.2483, -0.2478, -0.2603, -0.2540, -0.2403, -0.2287,
         -0.2183, -0.2148, -0.1953, -0.1969, -0.1752, -0.1655, -0.1618, -0.1556,
         -0.1391, -0.1271, -0.1194, -0.1127, -0.0995, -0.0883, -0.0796, -0.0689,
         -0.0597, -0.0512, -0.0417, -0.0312, -0.0215, -0.0107, -0.0008,  0.0093,
          0.0191,  0.0296,  0.0394,  0.0492,  0.0613,  0.0689,  0.0797,  0.0910,
          0.0985,  0.1068,  0.1174,  0.1256]], device='cuda:0')
theta gradient after softmax tensor([[60., 59., 58., 57., 56., 55., 54., 53., 52., 51., 50., 49., 48., 47.,
         46., 45., 44., 43., 42., 41., 40., 39., 38., 37., 36., 35., 34., 33.,
         32., 31., 30., 29., 28., 27., 26., 25., 24., 23., 22., 21., 20., 19.,
         18., 17., 16., 15., 14., 13., 12., 11., 10.,  9.,  8.,  7.,  6.,  5.,
          4.,  3.,  2.,  1.,  0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.,
         10., 11., 12., 13., 14., 15., 16., 17., 18., 19., 20., 21., 22., 23.,
         24., 25., 26., 27., 28., 29., 30., 31., 32., 33., 34., 35., 36., 37.,
         38., 39.]], device='cuda:0')

What I don’t understand:

Both the softmax input and output dimension is N, the theta gradient before softmax should be NN matrix. Why is it a 1N tensor here?
Does the gradient vanish because of the absolute value?

The model works fine with cross-entropy loss and with the same image. Thanks for any advice!!

ptrblck · February 12, 2023, 4:35pm

Based on your code you might be checking the scaled gradients. If you want to print the real gradients you could unscaled them first via the scaler object.

XZLeo · February 12, 2023, 5:07pm

Thanks for the reply first!
I added scaler.unscale_(optimizer) after scaler.scale(loss).backward() and printed the gradient again. The values are still scaled by 65536.

XZLeo · February 12, 2023, 7:37pm

Now I got it. When I removed the scaler, the gradient from backpropagation looks the same as the one from autograd. Though the problem of gradient vanishing still exists. I modified the question.