"Training" variables to do SVD

batlike · November 10, 2021, 4:55am

In a bid to get familiar with PyTorch syntax, I thought I’d try and see if I can use gradient descent to do SVD - but not just the standard SVD routine, instead multidimensional scaling (MDS) which requires SVD.

Essentially, I generated a random n x n matrix U, a random diagonal n x n matrix s, and a random n x n matrix Vh, just as a starting point. The goal is for U s Vh.T to approximate some matrix B ie. U s Vh.T ~ SVD(B).

I guess I’m running into two rookie pitfalls: (1) the loss is not updating after the first iteration (why?) (2) is it possible to “combine” two loss functions? Below, I “combine” the loss of the SVD approximation to the loss of the MDS approximation:

# load into pytorch
D = torch.rand(N, N).float()
H = torch.eye(N).float() - \
    1/N * torch.ones(D.shape).float()
B = 0.5 * torch.matmul(torch.matmul(H, D**2), H)
pts = torch.rand(N, 3).float()

# declare constants
stop_loss = 1e-2
step_size = stop_loss / 3

# emulate simulate SVD
U = torch.autograd.Variable(torch.rand(N, N), requires_grad = True)
s = torch.autograd.Variable(torch.diag(torch.sort(torch.rand(N), descending = True).values), requires_grad = True)
Vh = torch.autograd.Variable(torch.rand(N, N), requires_grad = True)

# find embedding
embed = torch.matmul(torch.sqrt(s), Vh)
X_hat = embed.T[:, :3] # select the first 3 coordinates --> x, y, z

for i in range(100000):

    # calculate loss1: how close is our SVD function?
    delta1 = (torch.matmul(U.T, U) - torch.eye(N)) + \
            (torch.matmul(V.T, V) - torch.eye(N)) + \
            (torch.matmul(torch.matmul(U, s), Vh.T) - B)

    L1 = torch.norm(delta1, p=2)

    # calculate loss2: how close is our MDS approximation?
    delta2 = torch.nan_to_num(X_hat - pts, nan=0.0)    

    L2 = torch.norm(delta2, p=2)

    # Backprop
    loss = L1 + L2
    loss.backward()
        
    # update
    U.data -= step_size * U.grad.data
    s.data -= step_size * s.grad.data
    V.data -= step_size * V.grad.data
    
    U.data.data.zero_()
    s.data.data.zero_()
    V.data.data.zero_()
    
    if i % 1000 == 0: 
        print('Loss is %s at iteration %i' % (loss, i))
    
    if abs(loss) < stop_loss:
        break

JuanFMontesinos · November 10, 2021, 12:32pm

Soo some hints.
Variable was deprecated couple of years ago
You don’t need to acess .data That just computes in-place modifications of the tensor.
You can define an optimizer and to pass tensors:

Some corrections would be

import torch
N=15
# load into pytorch
D = torch.rand(N, N).float()
H = torch.eye(N).float() - \
    1 / N * torch.ones(D.shape).float()
B = 0.5 * torch.matmul(torch.matmul(H, D ** 2), H)
pts = torch.rand(N, 3).float()

# declare constants
stop_loss = 1e-2
step_size = stop_loss / 3

# emulate simulate SVD
U = torch.rand(N, N, requires_grad=True)
s = torch.diag(torch.sort(torch.rand(N), descending=True).values).requires_grad_()
Vh = torch.rand(N, N, requires_grad=True)

optimizer = torch.optim.SGD([U,s,Vh], lr=step_size)
# find embedding
embed = torch.matmul(torch.sqrt(s), Vh)
X_hat = embed.T[:, :3]  # select the first 3 coordinates --> x, y, z

for i in range(100000):

    # calculate loss1: how close is our SVD function?
    delta1 = (torch.matmul(U.T, U) - torch.eye(N)) + \
             (torch.matmul(V.T, V) - torch.eye(N)) + \
             (torch.matmul(torch.matmul(U, s), Vh.T) - B)

    L1 = torch.norm(delta1, p=2)

    # calculate loss2: how close is our MDS approximation?
    delta2 = torch.nan_to_num(X_hat - pts, nan=0.0)

    L2 = torch.norm(delta2, p=2)

    # Backprop
    loss = L1 + L2
    loss.backward()

    # update
    optimizer.zero_grad()
    optimizer.step()

    if i % 1000 == 0:
        print('Loss is %s at iteration %i' % (loss, i))

    if abs(loss) < stop_loss:
        break

If you want to do your own optimization you should zeroe the gradients each step.
lastly, I don’t know whether this function is backpropagable or not torch.nan_to_num(X_hat - pts, nan=0.0) But I cannot imagine why you would have NaNs

Just to answer your questions:

    U.data.data.zero_()
    s.data.data.zero_()
    V.data.data.zero_()

This is making zero the values of U,s and V each iteration. Thus the result will be static cos you are “resetting” the tensor. And yes, you can combine L=L1+L2

batlike · November 10, 2021, 3:03pm

I see! Thank you so much for pointing this out!

Even when I declare an optimizer to step through like your suggestion above, the loss does not change. Is there a common step I am forgetting here?

import torch
torch.autograd.set_detect_anomaly(True)

N=15
# load into pytorch
D = torch.rand(N, N).float()
H = torch.eye(N).float() - \
    1 / N * torch.ones(D.shape).float()
B = 0.5 * torch.matmul(torch.matmul(H, D ** 2), H)
# pts = torch.rand(N, 3).float()

# declare constants
stop_loss = 1e-2
step_size = stop_loss / 3

# # emulate simulate SVD
# U = torch.rand(N, N, requires_grad=True)
# s = torch.diag(torch.sort(torch.rand(N), descending=True).values).requires_grad_()
# Vh = torch.rand(N, N, requires_grad=True)

optimizer = torch.optim.SGD([U,s,Vh], lr=step_size)
# find embedding
# embed = torch.matmul(torch.sqrt(s), Vh)
# X_hat = embed.T[:, :3]  # select the first 3 coordinates --> x, y, z

for i in range(100000):

    # calculate loss1: how close is our SVD function?
    delta1 = (torch.matmul(U.T, U) - torch.eye(N)) + \
             (torch.matmul(Vh.T, Vh) - torch.eye(N)) + \
             (torch.matmul(torch.matmul(U, s), Vh.T) - B)

    L1 = torch.norm(delta1, p=2)

    # calculate loss2: how close is our MDS approximation?
    # # delta2 = torch.nan_to_num(X_hat - pts, nan=0.0)

    # # L2 = torch.norm(delta2, p=2)

    # Backprop
    loss = L1
    loss.backward()

    # update
    optimizer.zero_grad()
    optimizer.step()

    if i % 1000 == 0:
        print('Loss is %s at iteration %i' % (loss, i))

    if abs(loss) < stop_loss:
        break

Output:

Loss is tensor(141.6570, grad_fn=<NormBackward1>) at iteration 0
Loss is tensor(141.6570, grad_fn=<NormBackward1>) at iteration 1000
Loss is tensor(141.6570, grad_fn=<NormBackward1>) at iteration 2000
Loss is tensor(141.6570, grad_fn=<NormBackward1>) at iteration 3000
Loss is tensor(141.6570, grad_fn=<NormBackward1>) at iteration 4000
...
...
etc.

batlike · November 10, 2021, 3:27pm

I see your point. Sometimes my numpy array pts have missing rows; my toy exercise is to try and recover those points. It looks something like:

pts 
>>>array([[1327.769     ,  922.555     ,   86.56067961],
       [          nan,           nan,           nan],
       [          nan,           nan,           nan],
       [          nan,           nan,           nan],
       [          nan,           nan,           nan],
       [1327.34660274,  921.29142466,   78.19481833],
       [          nan,           nan,           nan],
       [          nan,           nan,           nan],
       [          nan,           nan,           nan],
       [          nan,           nan,           nan],
       [          nan,           nan,           nan],
       [1316.07260274,  920.69542466,   85.07831347]])

I’m trying to set the loss on those missing rows to zero (ie. torch.nan_to_num(X_hat - pts, nan=0.0), while taking the real L2 norm on those non-missing values. Is there a function that allows PyTorch to consider nan values in its loss as 0?

JuanFMontesinos · November 10, 2021, 7:53pm

So if the gradient is not NaN or None I guess it works for any value but those ones.
Could you manage to get the tensors updated?

batlike · November 10, 2021, 8:06pm

Doesn’t seem like it. In 2 comments above, I commented out the second loss function, and only calculated the first to avoid this problem. But still, the loss remains constant!

JuanFMontesinos · November 11, 2021, 12:41am

Sorry, my bad
Step should be called before zeroing the grad

    optimizer.zero_grad()
    optimizer.step()

import torch
import matplotlib.pyplot as plt
torch.autograd.set_detect_anomaly(True)

N=5
# load into pytorch
D = torch.rand(N, N).float()
H = torch.eye(N).float() - \
    1 / N * torch.ones(D.shape).float()
B = 0.5 * torch.matmul(torch.matmul(H, D ** 2), H)
pts = torch.rand(N, 3).float()

# declare constants
stop_loss = 1e-2
step_size = stop_loss / 3

# # emulate simulate SVD
U = torch.rand(N, N, requires_grad=True)
s = torch.diag(torch.sort(torch.rand(N), descending=True).values).requires_grad_()
Vh = torch.rand(N, N, requires_grad=True)

optimizer = torch.optim.SGD([U,s,Vh], lr=step_size)
# find embedding
embed = torch.matmul(torch.sqrt(s), Vh)
print(embed)
X_hat = embed.T[:, :3]  # select the first 3 coordinates --> x, y, z
loss_hist = []
for i in range(10):
    print(f'Iteration {i}:')
    print(f'\t U_s:{U.sum()}, s:{s.sum()}, Vh_s:{Vh.sum()}')
    # calculate loss1: how close is our SVD function?
    delta1 = (torch.matmul(U.T, U) - torch.eye(N)) + \
             (torch.matmul(Vh.T, Vh) - torch.eye(N)) + \
             (torch.matmul(torch.matmul(U, s), Vh.T) - B)

    L1 = torch.norm(delta1, p=2)

    # calculate loss2: how close is our MDS approximation?
    delta2 = X_hat - pts
    # 
    L2 = torch.norm(delta2, p=2)
    print(f'\t L1: {L1}, L2: {L2}')
    # Backprop
    loss = L1
    loss.backward()
    loss_hist.append(loss.item())
    # update

Still if I add L2 there are NaNs but guess that’s about the maths (which i didn’t check)

batlike · November 12, 2021, 4:40am

I see…! The order matters a lot. Thank you for this feedback, I learned a lot.