Using tensor operations in forward function

I have 2 dense linear net (model1 and model2) with 30 neurons as the output for each. With the following forward function the loss does not reduce but when I change it to single net with single output neuron and hardtanh activation then the loss starts decreasing. Is there anything wrong with my forward function?

def forward(self, x1, x2):
    tens1 = self.model1(x1)
    tens2 = self.model2(x2)
    distance = torch.sqrt(torch.sum(( tens1 - tens2) ** 2, 1))
    return nn.Hardtanh(0.001, 1)(distance)

Does the way I have defined my forward function make any issue with the gradient back propagation ?
Thanks

P.S. I could narrow my problem down to the fact that it as something to do with the numerical calculations because I get the same issue with

distance = torch.sqrt(torch.sum(tens1 ** 2, 1))
return nn.Hardtanh(0.001, 1)(distance)

or distance = torch.sqrt(torch.sum(torch.abs(tens1), 1))

With the power operation I could get it to work with modifying learning rate, but with abs function no matter what rate I choose it doesn’t decrease!
Can anyone tell me what I am doing wrong please?


I think I figured it out myself. It is all about the range of values and learning rate. Playing with those parameters and ranges will fix the issue.

Also be careful when using sqrt for numbers that can be 0: sqrt will return nan gradients at 0. You might want to add an epsilon to avoid such issue if your model ever converge to a place where the distance is 0.