Trying to backpropagate through svd

I get following after one forward pass through a network where my loss function is calculated by using torch.svd.
MAGMA gesdd : the updating process of SBDSDC did not converge. Does anyone have any idea what might be wrong?

Your matrix probably is very ill-conditioned that the underlying lapack operation couldn’t converge. You can try to catch that error and print the input in that case and see what the matrix looks like exactly.

Thanks. Actually when i see the input to svd in try catch block. The input has lot of Nan values. I am unable to find out what made the output of network to become nan.

Can very similar singular values make the gradient thorugh svd Nan.
Below are the printed singular values for a single batch.

tensor([ 12.1675, 12.0520, 10.0973, 10.0002, 9.9813, 9.8853,
7.3679, 7.2976, 0.3496, 0.2729, 0.2605, 0.2010], device=‘cuda:0’)
tensor([ 12.1267, 12.0520, 10.0581, 10.0000, 9.9442, 9.8853,
7.3371, 7.2973, 0.3157, 0.2456, 0.2269, 0.1650], device=‘cuda:0’)
tensor([ 12.2977, 12.0520, 10.1992, 10.0880, 10.0000, 9.8853,
7.4365, 7.2974, 0.3493, 0.2985, 0.2740, 0.2152], device=‘cuda:0’)
tensor([ 12.1671, 12.0520, 10.0945, 10.0000, 9.9820, 9.8853,
7.3627, 7.2973, 0.3041, 0.2643, 0.2420, 0.1975], device=‘cuda:0’)
tensor([ 12.2938, 12.0520, 10.2006, 10.0838, 10.0000, 9.8853,
7.4468, 7.2974, 0.3078, 0.2585, 0.2148, 0.1598], device=‘cuda:0’)
tensor([ 12.3265, 12.0520, 10.2277, 10.1063, 10.0001, 9.8853,
7.4629, 7.2974, 0.3733, 0.3435, 0.2603, 0.2503], device=‘cuda:0’)
tensor([ 12.5809, 12.0520, 10.4342, 10.3264, 10.0000, 9.8853,
7.6103, 7.2973, 0.4250, 0.3824, 0.3240, 0.2292], device=‘cuda:0’)
tensor([ 12.1654, 12.0520, 10.0932, 10.0001, 9.9825, 9.8853,
7.3627, 7.2976, 0.4197, 0.3485, 0.3112, 0.2711], device=‘cuda:0’)
tensor([ 12.2557, 12.0521, 10.1630, 10.0525, 10.0000, 9.8854,
7.4115, 7.2973, 0.4303, 0.3432, 0.3237, 0.2445], device=‘cuda:0’)
tensor([ 12.5036, 12.0520, 10.3740, 10.2545, 10.0000, 9.8853,
7.5665, 7.2973, 0.2738, 0.2367, 0.2166, 0.1782], device=‘cuda:0’)
tensor([ 12.0985, 12.0520, 10.0394, 10.0001, 9.9249, 9.8853,
7.3254, 7.2974, 0.3275, 0.2966, 0.2596, 0.2185], device=‘cuda:0’)
tensor([ 12.5316, 12.0520, 10.4002, 10.2913, 10.0000, 9.8853,
7.5864, 7.2974, 0.3936, 0.3172, 0.2886, 0.2097], device=‘cuda:0’)
tensor([ 12.6889, 12.0520, 10.5327, 10.4080, 10.0000, 9.8853,
7.6836, 7.2974, 0.2977, 0.2453, 0.2348, 0.1856], device=‘cuda:0’)
tensor([ 12.1498, 12.0520, 10.0823, 10.0000, 9.9642, 9.8854,
7.3586, 7.2974, 0.4269, 0.3754, 0.3256, 0.2559], device=‘cuda:0’)
tensor([ 12.1555, 12.0520, 10.0841, 10.0001, 9.9664, 9.8853,
7.3581, 7.2974, 0.3186, 0.2729, 0.2538, 0.2028], device=‘cuda:0’)
tensor([ 12.3647, 12.0521, 10.2616, 10.1446, 10.0001, 9.8854,
7.4893, 7.2974, 0.3760, 0.2918, 0.2656, 0.2396], device=‘cuda:0’)
epoch 0[0/556], loss: 26438.254, coord_loss: 26437.879, conf_objloss: 0.074, conf_noobjloss: 0.211 cls_loss: 0.375 (8.85 s/batch, rest:1:21:58)
image_size [640 640]
Traceback (most recent call last):
U, S, Vh = torch.svd(XwX)
RuntimeError: MAGMA gesdd : the updating process of SBDSDC did not converge (error: 11) at /pytorch/aten/src/THC/generic/THCTensorMathMagma.cu:364

Process finished with exit code 1

Yeah, lapack can’t deal with NaNs unfortunately. You can try the anormaly detection mode to fine where NaN happens: https://pytorch.org/docs/master/autograd.html#anomaly-detection

I get the same error:

RuntimeError: MAGMA gesdd : the updating process of SBDSDC did not converge (error: 1) at /pytorch/aten/src/THC/generic/THCTensorMathMagma.cu:364

What can be the reason for ‘Nan’ while applying SVD? Is there an example that you can point to for applying anomaly detection for such scenario?

Thanks !

I am unable to regenerate the same error in torch using the below:

`a=torch.tensor([[float(‘NaN’), 0.0, 0.0, 9.57, -3.49, 9.84],
[9.93, 6.91, -7.93, 1.64, 4.02, 0.15],
[9.83, 5.04, 4.86, 8.83, 9.80, -8.99],
[5.45, -0.27, 4.85, 0.74, 10.00, -6.02],
[0.0, 7.98, 3.01, 5.80, 4.27, -5.31]]).t()

u, s, v = torch.svd(a)`

The current error is:
Lapack Error gesvd : 4 superdiagonals failed to converge. at /pytorch/aten/src/TH/generic/THTensorLapack.c:470

It would be great @kk1153 and @SimonW, if you can telling me how do I go about generating the same error and fix it.

In my original code, the error generates at 500th epoch and 13 hours of training, so I do need to replicate the error and fix it.

Thanks,
Shikha

You have a nan in your svd input… The reason for nan is usually the output of the previous function has a nan. The anonmly detection doc has a pretty good example on how to apply it on a module. It should be helpful in finding where it happens.

I would suggest you to look at the gradient at the previous step. In my case, the gradient were exploding in the previous iteration of training due to some normalization step i was doing after the svd. This made the input ‘Nan’, which causes this error when Nan goes into torch.svd.

So, what you can do is check the the value of gradients of variables which are output of svd, if they are becoming very large, You can look for gradients using backward hook. Also look for places, where you are dividing by some variable which can become close to zero (if there are any such variables, just add an epsilon to it)

@kk1153 @SimonW, thanks for your responses.

My reran my code and it worked this time. For some reason, the jobs failed on v100