Different loss values at different executions

I have been trying to carry out regression with my data from the following “my hdf file” to predict TWO outputs. But I have been getting different values for loss at different executions. Each execution produces 0 output value for either one or both the outputs.

The true output (targets) is the following:

label_batch  tensor([[1.4000, 1.6000],
        [1.4000, 1.6000],
        [1.4000, 1.6000],
        [1.4000, 1.6000],
        [1.4000, 1.6000],
        [1.4000, 1.6000],
        [1.4000, 1.6000],
        [1.4000, 1.6000],
        [1.4000, 1.6000]])

Output for say Run 1, which are the values closest to the targets.

Epoch: 1/10..  Training Loss: 0.9370870..  Test Loss: 0.0499674.. 
Epoch: 2/10..  Training Loss: 0.0297776..  Test Loss: 0.0184589.. 
Epoch: 3/10..  Training Loss: 0.0117876..  Test Loss: 0.0087056.. 
Epoch: 4/10..  Training Loss: 0.0054326..  Test Loss: 0.0034268.. 
Epoch: 5/10..  Training Loss: 0.0020547..  Test Loss: 0.0008109.. 
Epoch: 6/10..  Training Loss: 0.0004802..  Test Loss: 0.0001630.. 
Epoch: 7/10..  Training Loss: 0.0001397..  Test Loss: 0.0000947.. 
Epoch: 8/10..  Training Loss: 0.0000927..  Test Loss: 0.0000680.. 
Epoch: 9/10..  Training Loss: 0.0000720..  Test Loss: 0.0000503.. 
output prediction
tensor([[1.4031, 1.6043],
        [1.3971, 1.5961],
        [1.3949, 1.5948],
        [1.4034, 1.6042],
        [1.3858, 1.5825],
        [1.3948, 1.5934],
        [1.3978, 1.5992],
        [1.4024, 1.6062],
        [1.4026, 1.6032]], grad_fn=<ReluBackward0>)
Epoch: 10/10..  Training Loss: 0.0000559..  Test Loss: 0.0000394..

Output for say Run 2,

Epoch: 1/10..  Training Loss: 1.3082893..  Test Loss: 0.9945563.. 
Epoch: 2/10..  Training Loss: 0.9902789..  Test Loss: 0.9884600.. 
Epoch: 3/10..  Training Loss: 0.9855704..  Test Loss: 0.9838181.. 
Epoch: 4/10..  Training Loss: 0.9823556..  Test Loss: 0.9815999.. 
Epoch: 5/10..  Training Loss: 0.9810459..  Test Loss: 0.9806178.. 
Epoch: 6/10..  Training Loss: 0.9803946..  Test Loss: 0.9801530.. 
Epoch: 7/10..  Training Loss: 0.9801204..  Test Loss: 0.9800670.. 
Epoch: 8/10..  Training Loss: 0.9800622..  Test Loss: 0.9800397.. 
Epoch: 9/10..  Training Loss: 0.9800435..  Test Loss: 0.9800270.. 
output prediction
tensor([[0.0000, 1.5930],
        [0.0000, 1.5946],
        [0.0000, 1.5916],
        [0.0000, 1.5956],
        [0.0000, 1.5970],
        [0.0000, 1.5951],
        [0.0000, 1.5919],
        [0.0000, 1.5887],
        [0.0000, 1.5913]], grad_fn=<ReluBackward0>)
Epoch: 10/10..  Training Loss: 0.9800317..  Test Loss: 0.9800202.. 

For Run 3,

Epoch: 1/10..  Training Loss: 2.2600000..  Test Loss: 2.2600000.. 
Epoch: 2/10..  Training Loss: 2.2600000..  Test Loss: 2.2600000.. 
Epoch: 3/10..  Training Loss: 2.2600000..  Test Loss: 2.2600000.. 
Epoch: 4/10..  Training Loss: 2.2600000..  Test Loss: 2.2600000.. 
Epoch: 5/10..  Training Loss: 2.2600000..  Test Loss: 2.2600000.. 
Epoch: 6/10..  Training Loss: 2.2600000..  Test Loss: 2.2600000.. 
Epoch: 7/10..  Training Loss: 2.2600000..  Test Loss: 2.2600000.. 
Epoch: 8/10..  Training Loss: 2.2600000..  Test Loss: 2.2600000.. 
Epoch: 9/10..  Training Loss: 2.2600000..  Test Loss: 2.2600000.. 
output prediction
tensor([[0., 0.],
        [0., 0.],
        [0., 0.],
        [0., 0.],
        [0., 0.],
        [0., 0.],
        [0., 0.],
        [0., 0.],
        [0., 0.]], grad_fn=<ReluBackward0>)
Epoch: 10/10..  Training Loss: 2.2600000..  Test Loss: 2.2600000..

For Run 4,

Epoch: 1/10..  Training Loss: 1.3272166..  Test Loss: 0.9984156.. 
Epoch: 2/10..  Training Loss: 0.9895578..  Test Loss: 0.9840302.. 
Epoch: 3/10..  Training Loss: 0.9825955..  Test Loss: 0.9813274.. 
Epoch: 4/10..  Training Loss: 0.9807327..  Test Loss: 0.9804195.. 
Epoch: 5/10..  Training Loss: 0.9801892..  Test Loss: 0.9802050.. 
Epoch: 6/10..  Training Loss: 0.9801012..  Test Loss: 0.9801544.. 
Epoch: 7/10..  Training Loss: 0.9800708..  Test Loss: 0.9801207.. 
Epoch: 8/10..  Training Loss: 0.9800515..  Test Loss: 0.9800962.. 
Epoch: 9/10..  Training Loss: 0.9800386..  Test Loss: 0.9800771.. 
output prediction
tensor([[0.0000, 1.5929],
        [0.0000, 1.5888],
        [0.0000, 1.6203],
        [0.0000, 1.6003],
        [0.0000, 1.6016],
        [0.0000, 1.5979],
        [0.0000, 1.6009],
        [0.0000, 1.5887],
        [0.0000, 1.5899]], grad_fn=<ReluBackward0>)
Epoch: 10/10..  Training Loss: 0.9800294..  Test Loss: 0.9800624..

I assume it has something to do with the torch random seed because when I add

torch.manual_seed(0)

I always get the first of the TWO output values to be 0. i.e the output always resembles that of Run 4 of the above:

Epoch: 1/10..  Training Loss: 1.3272166..  Test Loss: 0.9984156.. 
Epoch: 2/10..  Training Loss: 0.9895578..  Test Loss: 0.9840302.. 
Epoch: 3/10..  Training Loss: 0.9825955..  Test Loss: 0.9813274.. 
Epoch: 4/10..  Training Loss: 0.9807327..  Test Loss: 0.9804195.. 
Epoch: 5/10..  Training Loss: 0.9801892..  Test Loss: 0.9802050.. 
Epoch: 6/10..  Training Loss: 0.9801012..  Test Loss: 0.9801544.. 
Epoch: 7/10..  Training Loss: 0.9800708..  Test Loss: 0.9801207.. 
Epoch: 8/10..  Training Loss: 0.9800515..  Test Loss: 0.9800962.. 
Epoch: 9/10..  Training Loss: 0.9800386..  Test Loss: 0.9800771.. 
output prediction
tensor([[0.0000, 1.5929],
        [0.0000, 1.5888],
        [0.0000, 1.6203],
        [0.0000, 1.6003],
        [0.0000, 1.6016],
        [0.0000, 1.5979],
        [0.0000, 1.6009],
        [0.0000, 1.5887],
        [0.0000, 1.5899]], grad_fn=<ReluBackward0>)
Epoch: 10/10..  Training Loss: 0.9800294..  Test Loss: 0.9800624.. 

I would like to attach my code and my hdf file here for reproducibility.
My code:

from pathlib import Path
import numpy as np
#np.random.seed(0)
import pandas as pd
import torch
#torch.manual_seed(0)
import matplotlib.pyplot as plt
from torch import nn, optim
from torch.utils.data import DataLoader, Dataset
import torch.nn.functional as F
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
import sys
from sklearn.utils import shuffle


class Regressor(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(4, 144)
        self.fc2 = nn.Linear(144, 72)
        self.fc3 = nn.Linear(72, 18)
        self.fc4 = nn.Linear(18, 2)


    def forward(self, x):


        #print("fc1", x.shape)
        x = F.relu(self.fc1(x))
        #print("fc2", x.shape)
        x = F.relu(self.fc2(x))
        #print("fc3", x.shape)
        x = F.relu(self.fc3(x))
        #print("fc4", x.shape)
        x = F.relu(self.fc4(x))
        #print("last", x.shape)

        return x




p = Path.cwd()

fpath = p/"Nov20_2019_romean_entries/SigxFactor1.4_SigyFactor1.6_Nov20_2019.h5"

data = pd.read_hdf(str(fpath), key="df")

data = shuffle(data)

print(data.columns)
print(data.isnull().values.any())
targets = data[["x_val", "y_val"]]
print(targets)
data = data.drop(["x_val","y_val"], axis=1)



columns = data.columns
print("data b4 minmax")
print(data.head())

print("columns shape ", len(columns))
print("data shape ",data.shape)

scaler = MinMaxScaler()
data = pd.DataFrame(scaler.fit_transform(data), columns = columns)
#data['SalePrice'] = sale_price
print(data.head())

#sys.exit()

X_train, X_val, y_train, y_val = train_test_split(data, targets, test_size=0.2)

#print("feature shape ", X_train.shape)
#print(X_val.shape)
#
#print("target shape ", y_train.shape)
#print(y_val.shape)


train_batch = np.array_split(X_train, 50)
label_batch = np.array_split(y_train, 50)

print("train batch len ", len(train_batch))
print("label batch len ", len(label_batch))

#print(train_batch[49])
#print(train_batch[49].to_numpy().shape)

print("label batch")
print(label_batch[49].to_numpy().shape)
print(label_batch[49])

for i in range(len(train_batch)):
    train_batch[i] = torch.from_numpy(train_batch[i].to_numpy()).float()
for i in range(len(label_batch)):
    label_batch[i] = torch.from_numpy(label_batch[i].to_numpy()).float()
    #label_batch[i] = torch.from_numpy(label_batch[i].to_numpy()).float().view(-1, 2)

print("label_batch ", label_batch[49])
print("label_batch shape ", label_batch[49].shape)


X_val = torch.from_numpy(X_val.to_numpy()).float()
y_val = torch.from_numpy(y_val.to_numpy()).float()
#y_val = torch.from_numpy(y_val.to_numpy()).float().view(-1, 2)


#print(len(train_batch))
#sys.exit()

#device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
#device = torch.device("cpu")
model = Regressor()
#model.to(dtype= torch.float64, device = device)


#ps = model(train_batch[0])
#print(ps.shape)
#print(ps)
#sys.exit()
#model = Regressor()
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

epochs = 10

#device =

train_losses, test_losses = [], []
for e in range(epochs):
    model.train()
    train_loss = 0
    for i in range(len(train_batch)):
        optimizer.zero_grad()
        #model.to(device)
        output = model(train_batch[i])
        #output = model(train_batch[i].to(dtype= torch.float64, device= device))


        loss = criterion(output, label_batch[i])
        #loss = criterion(output, label_batch[i].to(dtype=torch.float64, device = device))
        loss.backward()
        optimizer.step()

        train_loss += loss.item()

        if e==9 and i==49:
            print("output prediction")
            print(output)

    else:
        test_loss = 0
        accuracy = 0

        with torch.no_grad():
            model.eval()
            predictions = model(X_val)
            #predictions = model(X_val.to(dtype= torch.float64, device= device))
            #if i==49:
            #    print("inside")
            #    print(predictions)
            #    print(predictions.shape)
            #test_loss += torch.sqrt(criterion(torch.log(predictions), torch.log(y_val)))

            test_loss += criterion(predictions, y_val)

        train_losses.append(train_loss/len(train_batch))
        test_losses.append(test_loss)

        print("Epoch: {}/{}.. ".format(e+1, epochs),
              "Training Loss: {:.7f}.. ".format(train_loss/len(train_batch)),
              "Test Loss: {:.7f}.. ".format(test_loss))

I am wondering what might the exact reason behind this anomaly.

Thank you.

my hdf file

I’m not sure, if you are asking for bitwise reproducibility or why your model converges to a single class.
For the first point, have a look at the Reproducibility docs.
The second issue might occur, if your training is “unstable”. Did you play around with some hyper parameters or e.g. removed the last relu?

Thank you for the reply.
I find it weird that the model convergence is dependent on the execution. That is, sometimes it has high loss and other times the loss is low.
I have kept all hyperparameters at fixed values and have always used the same model for different executions.

class Regressor(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(4, 144)
        self.fc2 = nn.Linear(144, 72)
        self.fc3 = nn.Linear(72, 18)
        self.fc4 = nn.Linear(18, 2)


    def forward(self, x):


        #print("fc1", x.shape)
        x = F.relu(self.fc1(x))
        #print("fc2", x.shape)
        x = F.relu(self.fc2(x))
        #print("fc3", x.shape)
        x = F.relu(self.fc3(x))
        #print("fc4", x.shape)
        x = F.relu(self.fc4(x))
        #print("last", x.shape)

        return x

Do you think it might have something to do with autograd or random weight initialization?

Thank you.
I just noticed the relu at the last layer was the problem.