Weights not updating and all parameters set to none with pytorch neural network

infernal_zhull · July 14, 2022, 2:26am

from tkinter import Variable
import pandas as pd
from tqdm import tqdm
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import torch.optim as optim
import matplotlib.pyplot as plt
import time
import sys
import os


os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"

#Load Training.csv
trainingData = pd.read_csv('training.csv')


#The training data labels the data as either s or b for signal or background
#we use this mappings lambda to make s be 1 and b be 0 
mappings = {
    's': 1, 
    'b': 0
}
trainingData['Label'] = trainingData['Label'].apply(lambda x: mappings[x])

trainingData = trainingData.where(trainingData > -998.0, 0)

sig = nn.Sigmoid()


model = nn.Sequential(  nn.Linear(31, 60), 
                        nn.ReLU(), 
                        nn.Linear(60, 60),
                        nn.Linear(60, 60),
                        nn.Linear(60, 60),
                        nn.ReLU(),
                        nn.Linear(60, 60),
                        nn.Linear(60, 60), 
                        nn.ReLU(),
                        nn.Linear(60, 1))



criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)


#initializing some variables for plotting loss over epoch
yList = []
xList = []
yListlater = []
xListlater = []
laterInt = 1


inputData = torch.tensor(trainingData.drop(['Weight', 'Label'], axis=1).values, dtype=float)
outputData = torch.tensor(trainingData['Label'].values, dtype=float).reshape(250000, 1)
listIn = torch.chunk(inputData, 25000, dim=0)
listOut = torch.chunk(outputData, 25000, dim=0)

epochs = 100

for name, param in model.named_parameters():
        print(name, param.grad)

print(model)

print(model[0].weight)

for i in tqdm(range(epochs)):
    x = listIn[i]
    y = listOut[i]

    outData = model(x.float())
    
    optimizer.zero_grad()

    loss = criterion(outData, target=y)

    loss.backward()

    optimizer.step()

    if (i % 1 == 0):
        yList.append(loss.item())
        xList.append(i)

print(model[0].weight)

plt.plot(xList, yList)
plt.xlabel('epoch')
plt.ylabel('running_loss')
plt.title('loss vs epoch')
plt.show()

Here is the code, I have been trying to figure out why nothing is changing for a couple days now and no luck.
Here is the output as well.

0.weight None
0.bias None
2.weight None
2.bias None
3.weight None
3.bias None
4.weight None
4.bias None
6.weight None
6.bias None
7.weight None
7.bias None
9.weight None
9.bias None
Sequential(
(0): Linear(in_features=31, out_features=60, bias=True)
(1): ReLU()
(2): Linear(in_features=60, out_features=60, bias=True)
(3): Linear(in_features=60, out_features=60, bias=True)
(4): Linear(in_features=60, out_features=60, bias=True)
(5): ReLU()
(6): Linear(in_features=60, out_features=60, bias=True)
(7): Linear(in_features=60, out_features=60, bias=True)
(8): ReLU()
(9): Linear(in_features=60, out_features=1, bias=True)
)
Parameter containing:
tensor([[ 0.1255, -0.0629, -0.0803, …, 0.0183, -0.1115, 0.0929],
[ 0.0016, -0.1008, 0.1053, …, 0.1485, -0.0020, -0.0183],
[ 0.0559, -0.0457, 0.0267, …, 0.0708, 0.0448, -0.0680],
…,
[ 0.0056, 0.1003, 0.0257, …, -0.1627, 0.0517, -0.0623],
[-0.0674, 0.1298, -0.0035, …, 0.1628, -0.0131, -0.1417],
[-0.0073, 0.1473, -0.0778, …, 0.1646, 0.0768, 0.1202]],
requires_grad=True)
[In here is the training loop]
Parameter containing:
tensor([[ 0.1255, -0.0629, -0.0803, …, 0.0183, -0.1115, 0.0929],
[ 0.0016, -0.1008, 0.1053, …, 0.1485, -0.0020, -0.0183],
[ 0.0559, -0.0457, 0.0267, …, 0.0708, 0.0448, -0.0680],
…,
[ 0.0056, 0.1003, 0.0257, …, -0.1627, 0.0517, -0.0623],
[-0.0674, 0.1298, -0.0035, …, 0.1628, -0.0131, -0.1417],
[-0.0073, 0.1473, -0.0778, …, 0.1646, 0.0768, 0.1202]],

ptrblck · July 14, 2022, 4:56am

There are a few issues in your script:

“all parameters set to None” is not the case, as you are printing the .grad attributes before the first backward pass and thus it’s expected to see None gradients
the last linear layer outputs a single logit, which doesn’t make sense in combination with nn.CrossEntropyLoss as the loss function is used for a multi-class classification. Since your current model only outputs logits for a single class, it has to be correct since only a single target class is valid

If I fix this issue and use e.g. 2 classes, you can see that the model parameters change:

model = nn.Sequential(  nn.Linear(31, 60), 
                        nn.ReLU(), 
                        nn.Linear(60, 60),
                        nn.Linear(60, 60),
                        nn.Linear(60, 60),
                        nn.ReLU(),
                        nn.Linear(60, 60),
                        nn.Linear(60, 60), 
                        nn.ReLU(),
                        nn.Linear(60, 2))

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

print(model[0].weight)

epochs = 1000
for i in range(epochs):
    x = torch.randn(1, 31)
    y = torch.randint(0, 2, (1,))
    output = model(x)
    optimizer.zero_grad()
    loss = criterion(output, y)
    loss.backward()
    optimizer.step()

print(model[0].weight)

infernal_zhull · July 14, 2022, 5:19am

So if i wanted my target to be just a single number between 0 and 1, CrossEntropyLoss would be the wrong criterion? Which would be a better one? Or should I just have the last layer of nn.Linear be (60, 2) and have the two outputs, could I then make those two outputs be like the confidence in the input leading to a background event (0) or a signal event (1)? At least that was my vision just train a neural network to give me either a 0 or 1, also sorry I don’t really know whats possible with neural networks I am very new to this.

ptrblck · July 14, 2022, 5:36am

It depends what your actual use case is. If you are working on a binary classification you could use a single output and use nn.BCEWithLogitsLoss instead as the loss function.
Also, note that multiple stacked linear layers can be replaced with a single one if no activation function is used, so you might also want to adapt the model architecture.

infernal_zhull · July 14, 2022, 5:43am

Okay I will look into that new loss function, but after making the changes you mentioned to the code, at least from what I see all the weight and bias tensors are at zero and nan now. I may have also just changed it wrong but here are the sections of the code that I changed:

model = nn.Sequential(  nn.Linear(31, 60), 
                        nn.ReLU(), 
                        nn.Linear(60, 60),
                        nn.Linear(60, 60),
                        nn.Linear(60, 60),
                        nn.ReLU(),
                        nn.Linear(60, 60),
                        nn.Linear(60, 60), 
                        nn.ReLU(),
                        nn.Linear(60, 2))

Also changed the outputData to dtype=torch.long to fix an issue. Then also got rid of the reshape function on the outputData to make it a 0D tensor. Also thank you so much for the help its greatly appreciated!

Parameter containing:
tensor([[ 0.0539,  0.0271,  0.0439,  ...,  0.1696, -0.0786, -0.1070],
        [-0.0865,  0.0066, -0.1254,  ..., -0.1123, -0.0493, -0.0810],
        [ 0.0880, -0.0347,  0.1111,  ..., -0.1072,  0.1138, -0.1486],
        ...,
        [ 0.0956,  0.1438,  0.1690,  ..., -0.0213, -0.1159,  0.0290],
        [-0.0837,  0.0139, -0.0046,  ..., -0.1025,  0.0138, -0.0071],
        [ 0.1230, -0.0192,  0.0522,  ..., -0.1669, -0.1791,  0.0807]],
       requires_grad=True)

Parameter containing:
tensor([[-5.9281e+06, -3.3843e+03, -3.5574e+03,  ..., -1.2967e+01,
          1.5325e+01, -2.1859e+03],
        [-8.6483e-02,  6.5728e-03, -1.2543e-01,  ..., -1.1226e-01,
         -4.9323e-02, -8.1002e-02],
        [-1.0422e+07, -5.9497e+03, -6.2538e+03,  ..., -2.3201e+01,
          2.7193e+01, -3.8428e+03],
        ...,
        [-1.5777e+01,  1.2120e-01,  1.6190e-01,  ..., -2.1391e-02,
         -1.1571e-01,  9.5351e-03],
        [-8.3709e-02,  1.3877e-02, -4.5941e-03,  ..., -1.0254e-01,
          1.3761e-02, -7.1247e-03],
        [-1.8538e+07, -1.0584e+04, -1.1125e+04,  ..., -4.1247e+01,
          4.7990e+01, -6.8354e+03]], requires_grad=True)
0.weight tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])
0.bias tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
2.weight tensor([[nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        ...,
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan]])
2.bias tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])
3.weight tensor([[nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        ...,
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan]])
3.bias tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])
4.weight tensor([[nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        ...,
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan]])
4.bias tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])
6.weight tensor([[nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        ...,
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan]])
6.bias tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])
7.weight tensor([[nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        ...,
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan]])
7.bias tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
        nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])
9.weight tensor([[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
         nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
         nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
         nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
         nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]])
9.bias tensor([nan, nan])

ptrblck · July 14, 2022, 6:53am

If your parameters are invalid after a few iterations, your loss (and thus the gradients) might blow up which would overflow the parameters.
You could add the new loss function to my code snippet and experiment with it.
If you initialize the random data once using my code you will be able to overfit the dataset properly.

infernal_zhull · July 14, 2022, 8:56pm

So I am not sure how to make it so the loss doesn’t blow up as you say? Something I have seen in some of the neural network examples I have seen online is a nn.logsoftmax is this related to what you are suggesting I do? Also I have implemented the new loss function but not sure if I am using it correctly.

pos_weight = torch.ones([1])

criterionBCE = nn.BCEWithLogitsLoss(pos_weight=pos_weight)

inputData = torch.tensor(trainingData.drop(['Weight', 'Label'], axis=1).values, dtype=float)
outputData = torch.tensor(trainingData['Label'].values, dtype=torch.long)
listIn = torch.chunk(inputData, 25000, dim=0)
listOut = torch.chunk(outputData, 25000, dim=0)

epochs = 1000

print(model[0].weight)

for i in tqdm(range(epochs)):
    x = listIn[i]
    y = listOut[i].reshape([10, 1])

    outData = model(x.float())
    
    optimizer.zero_grad()

    loss = criterionBCE(outData, y.float())

    loss.backward()

    optimizer.step()

    if (i % 1 == 0):
        yList.append(loss.item())
        xList.append(i)

print(model[0].weight)

The way I made it is just by blindly following guides and seeing what other people do, but I don’t know how it works or if I am even using it the intended way.

Also going back to a previous point with the output layer of the model, should I keep the output layer to be nn.Linear(60, 2) or make it (60, 1). And followup question if I am to keep it (60, 2) should I make my true value reflect that? A thought I had since I want the nn to identify if the event is a signal or background make the true data have two columns one for if its a signal event and one for if its a column event and set the corresponding column value to 1 and the other 0?

Another question for how I am splitting up the training data, should I be feeding it to the model in these “batches” of 10? Is this normal or is there some better way I should be doing this.