Weird! Define custom loss function with same idea but get two different result

Hello,there.
I intended to define a loss function for NN regression problem. When the prediction and the true value are both positive numbers or negative numbers,the loss would be the normal MSE. However, when they are in different directions in math, to punish such situation, the loss would be 100*the normal MSE.

The weird thing is that, I use two ways to implement that,but I get different results of them.

The following is the first one:
The training data was stored in df[‘vector’] using tuple.
([features],[target value])
Here, I use 7 features to do regression with regards to target value.
for every element in df[‘vector’] , according to the different situation, the different loss would be give and the final loss for a epoch would be the sum of loss computed using single element.

class NN(nn.Module):
    def __init__(self):
        super(NN, self).__init__()
        self.model = nn.Sequential(        
            nn.Linear(7, 100),  
            nn.ReLU(),
            nn.Linear(100, 100),  
            nn.ReLU(),   
            nn.Linear(100, 1),  
        )  

    def forward(self, x):
        out = self.model(x
        return out
        
class ContrastiveLoss(torch.nn.Module):


    def __init__(self, margin=2.0):
        super(ContrastiveLoss, self).__init__()
        self.margin = margin

    def forward(self, output1, output2):
        try:
            if output1.data[0]/output2.data[0] >0:
                loss_contrastive=nn.MSELoss()(output1,output2)
            else:
                loss_contrastive=100*nn.MSELoss()(output1,output2)
                
        except:
            loss_contrastive=nn.MSELoss()(output1,output2)

        return loss_contrastive
Model = NN()
counter = []
loss=0
loss_history = [] 
iteration_number= 0 
criterion = ContrastiveLoss()
optimizer = optim.Adam(Model.parameters(), lr=0.0001)
num_epochs = 100
for epoch in range(num_epochs):
    for x in df['vector']:
        x_train = torch.from_numpy(np.array(x[0])).float()
        y_train = torch.from_numpy(np.array(x[1])).float()            
        inputs = Variable(x_train)
        target = Variable(y_train)
        # forward
        out = Model(inputs) 
        loss = criterion(out, target)+loss 
        
    loss=loss/len(df)
    # backward
    optimizer.zero_grad() 
    loss.backward(retain_graph=True) 
    optimizer.step() 

    if (epoch+1) % 1 == 0:
        print('Epoch[{}/{}], loss: {:.6f}'.format(epoch+1,
                                                  num_epochs,
                                                  loss.data[0]))
        iteration_number +=10
        counter.append(iteration_number)
        loss_history.append(loss.data[0])

It works well, just too slow.
The result

Epoch[1/100], loss: 1221.593872
Epoch[2/100], loss: 1169.508179
Epoch[3/100], loss: 1117.318359
Epoch[4/100], loss: 1065.926147
Epoch[5/100], loss: 1020.072205
Epoch[6/100], loss: 976.024841
Epoch[7/100], loss: 935.973450
Epoch[8/100], loss: 906.220398
Epoch[9/100], loss: 881.059326
Epoch[10/100], loss: 868.799316
Epoch[11/100], loss: 853.845886
Epoch[12/100], loss: 852.796265
Epoch[13/100], loss: 843.323669
Epoch[14/100], loss: 847.825928
Epoch[15/100], loss: 852.958923
Epoch[16/100], loss: 843.355103
Epoch[17/100], loss: 834.775635
Epoch[18/100], loss: 819.528076
Epoch[19/100], loss: 812.769348
Epoch[20/100], loss: 808.588867
Epoch[21/100], loss: 775.685608

Therefore, I try to implement another algorithm that can avoid the inefficient loop iterations on training data and computing the loss in a vectorization way.

In the second way, I separate the features and target values as x_train, y_train respectively. I keep the NN structure only change the ContrastiveLoss function. Same as in the first way,when prediction is different with the true value in direction, the loss would be 100* normal MSE. Here I change the detail data of Contrastive data by multiplying 10. The square operation in nn.MSEloss would achieve the goal.

x_train = torch.from_numpy(feaures).float()
y_train = torch.from_numpy(target values).float()

        
class ContrastiveLoss(torch.nn.Module):
    def __init__(self, margin=2.0):
        super(ContrastiveLoss, self).__init__()
    def forward(self, output1, output2):
        label=output1.data*output2.data
        label=label.numpy()
#        result_index=np.where(label>=0)
        wrong_index=np.where(label<0)
        t=output1.data.numpy()
        t[wrong_index]=t[wrong_index]*10
#        t[result_index]=t[result_index]
        s=output2.data.numpy()
        s[wrong_index]=s[wrong_index]*10
#        s[result_index]=s[result_index]
        output1.data=torch.from_numpy(t)
        output2.data=torch.from_numpy(s)        
        loss_contrastive=nn.MSELoss()(output1,output2)      
        print(loss_contrastive)
        return loss_contrastive        
newmodel = NN()
counter = []
loss_history = [] 
iteration_number= 0 
criterion = ContrastiveLoss()
optimizer = optim.Adam(newmodel.parameters(), lr=0.0001)
num_epochs = 100
for epoch in range(0,num_epochs):
    inputs = Variable(x_train)
    target = Variable(y_train)
    # forward
    out = newmodel(inputs) 

    loss = criterion(out, target) #
    # backward
    loss.backward() 
    optimizer.zero_grad() 
    optimizer.step() 
    if (epoch+1) % 1 == 0:
#        print('Epoch[{}/{}], loss: {:.6f}'.format(epoch+1,
#                                                  num_epochs,
#                                                  loss.data[0]))
        iteration_number +=10
        counter.append(iteration_number)
        loss_history.append(loss.data[0])

The second way is more efficient , but the loss becomes ‘inf’ after several epochs.
The result

Epoch[1.0/100], loss: 1057.704468
Epoch[2.0/100], loss: 72497.578125
Epoch[3.0/100], loss: 7008454.000000
Epoch[4.0/100], loss: 698522432.000000
Epoch[5.0/100], loss: 69829033984.000000
Epoch[6.0/100], loss: 6982675202048.000000
Epoch[7.0/100], loss: 698265716654080.000000
Epoch[8.0/100], loss: 69826543211249664.000000
Epoch[9.0/100], loss: 6982653651110068224.000000
Epoch[10.0/100], loss: 698264052294123257856.000000
Epoch[11.0/100], loss: 69826506278928964911104.000000
Epoch[12.0/100], loss: 6982656986975570338250752.000000
Epoch[13.0/100], loss: 698266743244440207628435456.000000
Epoch[14.0/100], loss: 69826322803288952153627951104.000000
Epoch[15.0/100], loss: 6982660633417258364712658141184.000000
Epoch[16.0/100], loss: 698264489320308698224080346677248.000000
Epoch[17.0/100], loss: inf
Epoch[18.0/100], loss: inf
Epoch[19.0/100], loss: inf
Epoch[20.0/100], loss: inf
Epoch[21.0/100], loss: inf
Epoch[22.0/100], loss: inf
Epoch[23.0/100], loss: inf
Epoch[24.0/100], loss: inf
Epoch[25.0/100], loss: inf
Epoch[26.0/100], loss: inf
Epoch[27.0/100], loss: inf
Epoch[28.0/100], loss: inf
Epoch[29.0/100], loss: inf
Epoch[30.0/100], loss: inf
Epoch[31.0/100], loss: inf
Epoch[32.0/100], loss: inf
Epoch[33.0/100], loss: inf
Epoch[34.0/100], loss: inf
Epoch[35.0/100], loss: inf
Epoch[36.0/100], loss: inf
Epoch[37.0/100], loss: inf
Epoch[38.0/100], loss: inf
Epoch[39.0/100], loss: inf
Epoch[40.0/100], loss: inf
Epoch[41.0/100], loss: inf
Epoch[42.0/100], loss: inf
Epoch[43.0/100], loss: inf
Epoch[44.0/100], loss: inf
Epoch[45.0/100], loss: inf

I am confused about the different results of these two methods. And I have been thinking it for whole day but cannot get it. As a newcomer to pytorch, I really need you guys’ help.

Thanks in advance!