Weighted Distance Function

Hi Pytorch
I’m trying to implement a weighted distance function for my loss function. Basically I’m creating a pairwise distance matrix dd between my two inputs X (n x 3 x 3) and Y (n x 3 x 3) of size n x n. My distance is basically taking the norm of the final dimension, and summing them. So dd = torch.sum(torch.norm(x-y,2,-1)). The thing is I want this distance to be weighted, so my idea was to do something like dd = 2torch.sum(torch.norm(x-y,2,-1)) + torch.max(torch.norm(x-y,2,-1))[0] - torch.min(torch.norm(x-y,2,-1))[0]. This in effect calculates a distance which is 3largest + 2middle + 1smallest.
My question is about autograd, namely, when calculating the gradients wont these be chain ruled out, just the same as when you double a loss, or something like that? How can I ensure this isn’t the case? How can i verify if it is? Would instantiating a new variable for these max and min values be enough?

The reason I’m trying to do this is the network seems to be settling in a local minimum where only 2 of the 3 vectors in the final dimension are correctly being reconstructed, so I wanted to give the largest loss more weight. Any help with this question would be greatly appreciated.

Autograd will take your loss into consideration when you call .backward() on the output of the loss. Yes, this is similar to multiplying your final loss by, say, 2 and then calling .backward() on it. Why would you not want this to be the case?

The easiest way to verify this is to pass in some random values (or ones, or zeros) to your loss function, do a .backward(), and check the gradients of your inputs:

inputs = Variable(torch.ones(n, 3, 3), requires_grad=True)
targets = Variable(torch.ones(n, 3, 3))
loss(inputs, targets).backward()

I don’t think max and min are differentiable, though.

Well my first question was during the backwards pass, I want the largest distance to have the highest weighted gradient. However, if you multiply that by a constant, won’t it be divided out when calculating the gradient? As for if they’re differentiable or not, it seems to be working as intended, and I’ve tried to verify that the graph is unbroken, so I don’t think its necessarily an issue.
Here is my code (it also has a permutation part since I’m trying to find order invariant minimums along several axes)

    # x = [n x 3 x 3]
    # y = [n x 3 x 3]
    x,y = x.view(-1,3,3),y.view(-1,3,3)
    xx = torch.stack([x]*6,1)
    for i,p in enumerate(itertools.permutations([0,1,2], 3)):
        for j,c in enumerate(p):
            xx[:,i,j] = x[:,c]
    xx = xx.unsqueeze(1).expand(-1,xx.size(0),-1,-1,-1)
    # xx = [n x n x 6 x 3 x 3]
    yy = y.unsqueeze(1).expand(-1,6,-1,-1)
    zz = yy-xx
    dnorms = torch.norm(zz,2,-1)
    dmins = torch.min(dnorms,-1)[0]
    dmaxs = torch.max(dnorms,-1)[0]
    #heres the weighting part
    dd = torch.sum(dnorms,-1)*2 + dmaxs - dmins
    # dd[i,j,k] = distance from x[i] to y[j] for permutation k
    # find minimum over all permutations
    pd,_ = torch.min(dd,-1)
    # minimum distance from x to y
    xmin,_ = torch.min(pd,0)
    # minimum distance from y to x
    ymin,_ = torch.min(pd,1)
    return torch.mean(xmin) + torch.mean(ymin)```

If you multiply it by a constant, the gradient computation will include that constant. If you have f(x) and you multiply it by 2, ie, 2 * f(x), differentiating wrt x gives 2 * f'(x). It’s the same thing here.

If you weight the distances accordingly, then sum them together, the distance with the largest weight will (probably) contribute the most to the gradient, depending on how the rest of your functions look.

Ok thanks. I’ve wondered if it’s worth my time to implement this as its own autograd function, for my sanity, but I must have rewritten this a thousand times already.