Trying to backpropagate through svd

kk1153 · October 6, 2018, 8:26am

I get following after one forward pass through a network where my loss function is calculated by using torch.svd.
MAGMA gesdd : the updating process of SBDSDC did not converge. Does anyone have any idea what might be wrong?

SimonW · October 6, 2018, 3:58pm

Your matrix probably is very ill-conditioned that the underlying lapack operation couldn’t converge. You can try to catch that error and print the input in that case and see what the matrix looks like exactly.

kk1153 · October 7, 2018, 2:31am

Thanks. Actually when i see the input to svd in try catch block. The input has lot of Nan values. I am unable to find out what made the output of network to become nan.

kk1153 · October 7, 2018, 5:19am

Can very similar singular values make the gradient thorugh svd Nan.
Below are the printed singular values for a single batch.

tensor([ 12.1675, 12.0520, 10.0973, 10.0002, 9.9813, 9.8853,
7.3679, 7.2976, 0.3496, 0.2729, 0.2605, 0.2010], device=‘cuda:0’)
tensor([ 12.1267, 12.0520, 10.0581, 10.0000, 9.9442, 9.8853,
7.3371, 7.2973, 0.3157, 0.2456, 0.2269, 0.1650], device=‘cuda:0’)
tensor([ 12.2977, 12.0520, 10.1992, 10.0880, 10.0000, 9.8853,
7.4365, 7.2974, 0.3493, 0.2985, 0.2740, 0.2152], device=‘cuda:0’)
tensor([ 12.1671, 12.0520, 10.0945, 10.0000, 9.9820, 9.8853,
7.3627, 7.2973, 0.3041, 0.2643, 0.2420, 0.1975], device=‘cuda:0’)
tensor([ 12.2938, 12.0520, 10.2006, 10.0838, 10.0000, 9.8853,
7.4468, 7.2974, 0.3078, 0.2585, 0.2148, 0.1598], device=‘cuda:0’)
tensor([ 12.3265, 12.0520, 10.2277, 10.1063, 10.0001, 9.8853,
7.4629, 7.2974, 0.3733, 0.3435, 0.2603, 0.2503], device=‘cuda:0’)
tensor([ 12.5809, 12.0520, 10.4342, 10.3264, 10.0000, 9.8853,
7.6103, 7.2973, 0.4250, 0.3824, 0.3240, 0.2292], device=‘cuda:0’)
tensor([ 12.1654, 12.0520, 10.0932, 10.0001, 9.9825, 9.8853,
7.3627, 7.2976, 0.4197, 0.3485, 0.3112, 0.2711], device=‘cuda:0’)
tensor([ 12.2557, 12.0521, 10.1630, 10.0525, 10.0000, 9.8854,
7.4115, 7.2973, 0.4303, 0.3432, 0.3237, 0.2445], device=‘cuda:0’)
tensor([ 12.5036, 12.0520, 10.3740, 10.2545, 10.0000, 9.8853,
7.5665, 7.2973, 0.2738, 0.2367, 0.2166, 0.1782], device=‘cuda:0’)
tensor([ 12.0985, 12.0520, 10.0394, 10.0001, 9.9249, 9.8853,
7.3254, 7.2974, 0.3275, 0.2966, 0.2596, 0.2185], device=‘cuda:0’)
tensor([ 12.5316, 12.0520, 10.4002, 10.2913, 10.0000, 9.8853,
7.5864, 7.2974, 0.3936, 0.3172, 0.2886, 0.2097], device=‘cuda:0’)
tensor([ 12.6889, 12.0520, 10.5327, 10.4080, 10.0000, 9.8853,
7.6836, 7.2974, 0.2977, 0.2453, 0.2348, 0.1856], device=‘cuda:0’)
tensor([ 12.1498, 12.0520, 10.0823, 10.0000, 9.9642, 9.8854,
7.3586, 7.2974, 0.4269, 0.3754, 0.3256, 0.2559], device=‘cuda:0’)
tensor([ 12.1555, 12.0520, 10.0841, 10.0001, 9.9664, 9.8853,
7.3581, 7.2974, 0.3186, 0.2729, 0.2538, 0.2028], device=‘cuda:0’)
tensor([ 12.3647, 12.0521, 10.2616, 10.1446, 10.0001, 9.8854,
7.4893, 7.2974, 0.3760, 0.2918, 0.2656, 0.2396], device=‘cuda:0’)
epoch 0[0/556], loss: 26438.254, coord_loss: 26437.879, conf_objloss: 0.074, conf_noobjloss: 0.211 cls_loss: 0.375 (8.85 s/batch, rest:1:21:58)
image_size [640 640]
Traceback (most recent call last):
U, S, Vh = torch.svd(XwX)
RuntimeError: MAGMA gesdd : the updating process of SBDSDC did not converge (error: 11) at /pytorch/aten/src/THC/generic/THCTensorMathMagma.cu:364

Process finished with exit code 1

SimonW · October 8, 2018, 6:45am

Yeah, lapack can’t deal with NaNs unfortunately. You can try the anormaly detection mode to fine where NaN happens: https://pytorch.org/docs/master/autograd.html#anomaly-detection

Shikha_Bordia · October 28, 2018, 6:10pm

I get the same error:

RuntimeError: MAGMA gesdd : the updating process of SBDSDC did not converge (error: 1) at /pytorch/aten/src/THC/generic/THCTensorMathMagma.cu:364

What can be the reason for ‘Nan’ while applying SVD? Is there an example that you can point to for applying anomaly detection for such scenario?

Thanks !

Shikha_Bordia · October 29, 2018, 4:13pm

I am unable to regenerate the same error in torch using the below:

`a=torch.tensor([[float(‘NaN’), 0.0, 0.0, 9.57, -3.49, 9.84],
[9.93, 6.91, -7.93, 1.64, 4.02, 0.15],
[9.83, 5.04, 4.86, 8.83, 9.80, -8.99],
[5.45, -0.27, 4.85, 0.74, 10.00, -6.02],
[0.0, 7.98, 3.01, 5.80, 4.27, -5.31]]).t()

u, s, v = torch.svd(a)`

The current error is:
Lapack Error gesvd : 4 superdiagonals failed to converge. at /pytorch/aten/src/TH/generic/THTensorLapack.c:470

It would be great @kk1153 and @SimonW, if you can telling me how do I go about generating the same error and fix it.

In my original code, the error generates at 500th epoch and 13 hours of training, so I do need to replicate the error and fix it.

Thanks,
Shikha

SimonW · October 29, 2018, 5:29pm

You have a nan in your svd input… The reason for nan is usually the output of the previous function has a nan. The anonmly detection doc has a pretty good example on how to apply it on a module. It should be helpful in finding where it happens.

kk1153 · October 29, 2018, 11:39pm

I would suggest you to look at the gradient at the previous step. In my case, the gradient were exploding in the previous iteration of training due to some normalization step i was doing after the svd. This made the input ‘Nan’, which causes this error when Nan goes into torch.svd.

So, what you can do is check the the value of gradients of variables which are output of svd, if they are becoming very large, You can look for gradients using backward hook. Also look for places, where you are dividing by some variable which can become close to zero (if there are any such variables, just add an epsilon to it)

Shikha_Bordia · October 31, 2018, 7:07pm

@kk1153 @SimonW, thanks for your responses.

My reran my code and it worked this time. For some reason, the jobs failed on v100