Different loss values at different executions

Ahmed_Mustahid · December 11, 2019, 1:54am

I have been trying to carry out regression with my data from the following “my hdf file” to predict TWO outputs. But I have been getting different values for loss at different executions. Each execution produces 0 output value for either one or both the outputs.

The true output (targets) is the following:

label_batch  tensor([[1.4000, 1.6000],
        [1.4000, 1.6000],
        [1.4000, 1.6000],
        [1.4000, 1.6000],
        [1.4000, 1.6000],
        [1.4000, 1.6000],
        [1.4000, 1.6000],
        [1.4000, 1.6000],
        [1.4000, 1.6000]])

Output for say Run 1, which are the values closest to the targets.

Epoch: 1/10..  Training Loss: 0.9370870..  Test Loss: 0.0499674.. 
Epoch: 2/10..  Training Loss: 0.0297776..  Test Loss: 0.0184589.. 
Epoch: 3/10..  Training Loss: 0.0117876..  Test Loss: 0.0087056.. 
Epoch: 4/10..  Training Loss: 0.0054326..  Test Loss: 0.0034268.. 
Epoch: 5/10..  Training Loss: 0.0020547..  Test Loss: 0.0008109.. 
Epoch: 6/10..  Training Loss: 0.0004802..  Test Loss: 0.0001630.. 
Epoch: 7/10..  Training Loss: 0.0001397..  Test Loss: 0.0000947.. 
Epoch: 8/10..  Training Loss: 0.0000927..  Test Loss: 0.0000680.. 
Epoch: 9/10..  Training Loss: 0.0000720..  Test Loss: 0.0000503.. 
output prediction
tensor([[1.4031, 1.6043],
        [1.3971, 1.5961],
        [1.3949, 1.5948],
        [1.4034, 1.6042],
        [1.3858, 1.5825],
        [1.3948, 1.5934],
        [1.3978, 1.5992],
        [1.4024, 1.6062],
        [1.4026, 1.6032]], grad_fn=<ReluBackward0>)
Epoch: 10/10..  Training Loss: 0.0000559..  Test Loss: 0.0000394..

Output for say Run 2,

Epoch: 1/10..  Training Loss: 1.3082893..  Test Loss: 0.9945563.. 
Epoch: 2/10..  Training Loss: 0.9902789..  Test Loss: 0.9884600.. 
Epoch: 3/10..  Training Loss: 0.9855704..  Test Loss: 0.9838181.. 
Epoch: 4/10..  Training Loss: 0.9823556..  Test Loss: 0.9815999.. 
Epoch: 5/10..  Training Loss: 0.9810459..  Test Loss: 0.9806178.. 
Epoch: 6/10..  Training Loss: 0.9803946..  Test Loss: 0.9801530.. 
Epoch: 7/10..  Training Loss: 0.9801204..  Test Loss: 0.9800670.. 
Epoch: 8/10..  Training Loss: 0.9800622..  Test Loss: 0.9800397.. 
Epoch: 9/10..  Training Loss: 0.9800435..  Test Loss: 0.9800270.. 
output prediction
tensor([[0.0000, 1.5930],
        [0.0000, 1.5946],
        [0.0000, 1.5916],
        [0.0000, 1.5956],
        [0.0000, 1.5970],
        [0.0000, 1.5951],
        [0.0000, 1.5919],
        [0.0000, 1.5887],
        [0.0000, 1.5913]], grad_fn=<ReluBackward0>)
Epoch: 10/10..  Training Loss: 0.9800317..  Test Loss: 0.9800202..

For Run 3,

Epoch: 1/10..  Training Loss: 2.2600000..  Test Loss: 2.2600000.. 
Epoch: 2/10..  Training Loss: 2.2600000..  Test Loss: 2.2600000.. 
Epoch: 3/10..  Training Loss: 2.2600000..  Test Loss: 2.2600000.. 
Epoch: 4/10..  Training Loss: 2.2600000..  Test Loss: 2.2600000.. 
Epoch: 5/10..  Training Loss: 2.2600000..  Test Loss: 2.2600000.. 
Epoch: 6/10..  Training Loss: 2.2600000..  Test Loss: 2.2600000.. 
Epoch: 7/10..  Training Loss: 2.2600000..  Test Loss: 2.2600000.. 
Epoch: 8/10..  Training Loss: 2.2600000..  Test Loss: 2.2600000.. 
Epoch: 9/10..  Training Loss: 2.2600000..  Test Loss: 2.2600000.. 
output prediction
tensor([[0., 0.],
        [0., 0.],
        [0., 0.],
        [0., 0.],
        [0., 0.],
        [0., 0.],
        [0., 0.],
        [0., 0.],
        [0., 0.]], grad_fn=<ReluBackward0>)
Epoch: 10/10..  Training Loss: 2.2600000..  Test Loss: 2.2600000..

For Run 4,

Epoch: 1/10..  Training Loss: 1.3272166..  Test Loss: 0.9984156.. 
Epoch: 2/10..  Training Loss: 0.9895578..  Test Loss: 0.9840302.. 
Epoch: 3/10..  Training Loss: 0.9825955..  Test Loss: 0.9813274.. 
Epoch: 4/10..  Training Loss: 0.9807327..  Test Loss: 0.9804195.. 
Epoch: 5/10..  Training Loss: 0.9801892..  Test Loss: 0.9802050.. 
Epoch: 6/10..  Training Loss: 0.9801012..  Test Loss: 0.9801544.. 
Epoch: 7/10..  Training Loss: 0.9800708..  Test Loss: 0.9801207.. 
Epoch: 8/10..  Training Loss: 0.9800515..  Test Loss: 0.9800962.. 
Epoch: 9/10..  Training Loss: 0.9800386..  Test Loss: 0.9800771.. 
output prediction
tensor([[0.0000, 1.5929],
        [0.0000, 1.5888],
        [0.0000, 1.6203],
        [0.0000, 1.6003],
        [0.0000, 1.6016],
        [0.0000, 1.5979],
        [0.0000, 1.6009],
        [0.0000, 1.5887],
        [0.0000, 1.5899]], grad_fn=<ReluBackward0>)
Epoch: 10/10..  Training Loss: 0.9800294..  Test Loss: 0.9800624..

I assume it has something to do with the torch random seed because when I add

torch.manual_seed(0)

I always get the first of the TWO output values to be 0. i.e the output always resembles that of Run 4 of the above:

Epoch: 1/10..  Training Loss: 1.3272166..  Test Loss: 0.9984156.. 
Epoch: 2/10..  Training Loss: 0.9895578..  Test Loss: 0.9840302.. 
Epoch: 3/10..  Training Loss: 0.9825955..  Test Loss: 0.9813274.. 
Epoch: 4/10..  Training Loss: 0.9807327..  Test Loss: 0.9804195.. 
Epoch: 5/10..  Training Loss: 0.9801892..  Test Loss: 0.9802050.. 
Epoch: 6/10..  Training Loss: 0.9801012..  Test Loss: 0.9801544.. 
Epoch: 7/10..  Training Loss: 0.9800708..  Test Loss: 0.9801207.. 
Epoch: 8/10..  Training Loss: 0.9800515..  Test Loss: 0.9800962.. 
Epoch: 9/10..  Training Loss: 0.9800386..  Test Loss: 0.9800771.. 
output prediction
tensor([[0.0000, 1.5929],
        [0.0000, 1.5888],
        [0.0000, 1.6203],
        [0.0000, 1.6003],
        [0.0000, 1.6016],
        [0.0000, 1.5979],
        [0.0000, 1.6009],
        [0.0000, 1.5887],
        [0.0000, 1.5899]], grad_fn=<ReluBackward0>)
Epoch: 10/10..  Training Loss: 0.9800294..  Test Loss: 0.9800624..

I would like to attach my code and my hdf file here for reproducibility.
My code:

from pathlib import Path
import numpy as np
#np.random.seed(0)
import pandas as pd
import torch
#torch.manual_seed(0)
import matplotlib.pyplot as plt
from torch import nn, optim
from torch.utils.data import DataLoader, Dataset
import torch.nn.functional as F
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
import sys
from sklearn.utils import shuffle


class Regressor(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(4, 144)
        self.fc2 = nn.Linear(144, 72)
        self.fc3 = nn.Linear(72, 18)
        self.fc4 = nn.Linear(18, 2)


    def forward(self, x):


        #print("fc1", x.shape)
        x = F.relu(self.fc1(x))
        #print("fc2", x.shape)
        x = F.relu(self.fc2(x))
        #print("fc3", x.shape)
        x = F.relu(self.fc3(x))
        #print("fc4", x.shape)
        x = F.relu(self.fc4(x))
        #print("last", x.shape)

        return x




p = Path.cwd()

fpath = p/"Nov20_2019_romean_entries/SigxFactor1.4_SigyFactor1.6_Nov20_2019.h5"

data = pd.read_hdf(str(fpath), key="df")

data = shuffle(data)

print(data.columns)
print(data.isnull().values.any())
targets = data[["x_val", "y_val"]]
print(targets)
data = data.drop(["x_val","y_val"], axis=1)



columns = data.columns
print("data b4 minmax")
print(data.head())

print("columns shape ", len(columns))
print("data shape ",data.shape)

scaler = MinMaxScaler()
data = pd.DataFrame(scaler.fit_transform(data), columns = columns)
#data['SalePrice'] = sale_price
print(data.head())

#sys.exit()

X_train, X_val, y_train, y_val = train_test_split(data, targets, test_size=0.2)

#print("feature shape ", X_train.shape)
#print(X_val.shape)
#
#print("target shape ", y_train.shape)
#print(y_val.shape)


train_batch = np.array_split(X_train, 50)
label_batch = np.array_split(y_train, 50)

print("train batch len ", len(train_batch))
print("label batch len ", len(label_batch))

#print(train_batch[49])
#print(train_batch[49].to_numpy().shape)

print("label batch")
print(label_batch[49].to_numpy().shape)
print(label_batch[49])

for i in range(len(train_batch)):
    train_batch[i] = torch.from_numpy(train_batch[i].to_numpy()).float()
for i in range(len(label_batch)):
    label_batch[i] = torch.from_numpy(label_batch[i].to_numpy()).float()
    #label_batch[i] = torch.from_numpy(label_batch[i].to_numpy()).float().view(-1, 2)

print("label_batch ", label_batch[49])
print("label_batch shape ", label_batch[49].shape)


X_val = torch.from_numpy(X_val.to_numpy()).float()
y_val = torch.from_numpy(y_val.to_numpy()).float()
#y_val = torch.from_numpy(y_val.to_numpy()).float().view(-1, 2)


#print(len(train_batch))
#sys.exit()

#device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
#device = torch.device("cpu")
model = Regressor()
#model.to(dtype= torch.float64, device = device)


#ps = model(train_batch[0])
#print(ps.shape)
#print(ps)
#sys.exit()
#model = Regressor()
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

epochs = 10

#device =

train_losses, test_losses = [], []
for e in range(epochs):
    model.train()
    train_loss = 0
    for i in range(len(train_batch)):
        optimizer.zero_grad()
        #model.to(device)
        output = model(train_batch[i])
        #output = model(train_batch[i].to(dtype= torch.float64, device= device))


        loss = criterion(output, label_batch[i])
        #loss = criterion(output, label_batch[i].to(dtype=torch.float64, device = device))
        loss.backward()
        optimizer.step()

        train_loss += loss.item()

        if e==9 and i==49:
            print("output prediction")
            print(output)

    else:
        test_loss = 0
        accuracy = 0

        with torch.no_grad():
            model.eval()
            predictions = model(X_val)
            #predictions = model(X_val.to(dtype= torch.float64, device= device))
            #if i==49:
            #    print("inside")
            #    print(predictions)
            #    print(predictions.shape)
            #test_loss += torch.sqrt(criterion(torch.log(predictions), torch.log(y_val)))

            test_loss += criterion(predictions, y_val)

        train_losses.append(train_loss/len(train_batch))
        test_losses.append(test_loss)

        print("Epoch: {}/{}.. ".format(e+1, epochs),
              "Training Loss: {:.7f}.. ".format(train_loss/len(train_batch)),
              "Test Loss: {:.7f}.. ".format(test_loss))

I am wondering what might the exact reason behind this anomaly.

Thank you.

my hdf file

ptrblck · December 11, 2019, 2:51am

I’m not sure, if you are asking for bitwise reproducibility or why your model converges to a single class.
For the first point, have a look at the Reproducibility docs.
The second issue might occur, if your training is “unstable”. Did you play around with some hyper parameters or e.g. removed the last relu?

Ahmed_Mustahid · December 11, 2019, 3:40am

Thank you for the reply.
I find it weird that the model convergence is dependent on the execution. That is, sometimes it has high loss and other times the loss is low.
I have kept all hyperparameters at fixed values and have always used the same model for different executions.

class Regressor(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(4, 144)
        self.fc2 = nn.Linear(144, 72)
        self.fc3 = nn.Linear(72, 18)
        self.fc4 = nn.Linear(18, 2)


    def forward(self, x):


        #print("fc1", x.shape)
        x = F.relu(self.fc1(x))
        #print("fc2", x.shape)
        x = F.relu(self.fc2(x))
        #print("fc3", x.shape)
        x = F.relu(self.fc3(x))
        #print("fc4", x.shape)
        x = F.relu(self.fc4(x))
        #print("last", x.shape)

        return x

Do you think it might have something to do with autograd or random weight initialization?

Ahmed_Mustahid · December 12, 2019, 3:36am

Thank you.
I just noticed the relu at the last layer was the problem.