Outputing 'nan' as predictions in my neural network training loop

For my neural network I noticed that my predictions were coming out to be ‘nan’ in my training loop. To overcome this problem I have tried downgrading my PyTorch from 11.8 to 11.7 but that only changed the device from using cpu to gpu. I don’t know if this is a bug with PyTorch or if my code is just not working. Any advice would help.

import torch
import numpy as np
from torch import nn
from torch import optim
from torch.utils.data import TensorDataset
from torch.utils.data import DataLoader
import pandas as pd

# Get data from tables
# Split data up into training and testing (80% Train, 20% Test)
pnt_cont_data = pd.read_excel('C:\\Users\\bamar\\Downloads\\Chile_Research\\Research_Resources\\DoE_point_contact.xlsm', sheet_name='Sample', index_col = 0, names=['#','E1','E2','v1','v2','Ap','rho0','mue0','u1','u2','R','Fn','hmin','hc','p'])
pnt_cont_data.drop(index = pnt_cont_data.index[0], axis = 0, inplace =True)
del pnt_cont_data['p']

# Turning data into numpy arrays
X = pnt_cont_data.to_numpy()[:,:-3]
split = int(0.8*len(X))
X_train_np, X_test_np = X[:split], X[split:]

y = pnt_cont_data.to_numpy()[:,11:]
y_train_np, y_test_np = y[:split], y[split:]

X_train_float = X_train_np.astype(np.float32)
X_train = torch.Tensor(X_train_float)


X_test_float = X_test_np.astype(np.float32)
X_test = torch.Tensor(X_test_float)

y_train_float = y_train_np.astype(np.float32)
y_train = torch.Tensor(y_train_float)

y_test_float = y_test_np.astype(np.float32)
y_test = torch.Tensor(y_test_float)

# Creating a train dataset and dataloader with batch_size = 40
train_dataset = TensorDataset(X_train, y_train)
train_dataloader = DataLoader(train_dataset, batch_size = 40)


# Creating a test dataset and dataloader with batch_size = 20
test_dataset = TensorDataset(X_test, y_test)
test_dataloader = DataLoader(test_dataset, batch_size = 10)


# Building Neural Network
#device = 'cuda' if torch.cuda.is_available() else 'cpu'
device = torch.device('cuda')

input_dim = 10
output_dim = 2

class NN(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(NN, self).__init__()
        self.linear = nn.Linear(input_dim, output_dim)
        
        '''
        Add hidden layers here:
        '''
        
    def forward(self, x):
        y = self.linear(x)
        
        '''
        Add activation functions for hidden layers here:
        '''
        return y

model = NN(input_dim, output_dim).to(device)

lr = 0.1
optimizer = optim.Adam(model.parameters(), lr = lr)
criterion = nn.MSELoss()


LOSS = []

epochs = 100

for epoch in range(epochs):
    for X, y in train_dataloader:
        X, y = X.to(device), y.to(device)
        
        #X and y are printing as the data given
        #y_pred is printing as nan
        y_pred = model(X)
        
        
        loss = criterion(y_pred, y)
        LOSS.append(loss.item())
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

Hi Brooklyn!

Check that your input data is free of nans.

Try a single forward pass. Are the outputs of your network free of nans?

Try a single backward pass. Are your gradients free of nans?

Try a single optimization step. Are your model weights free of nans?

If you pass these tests, your training is probably becoming unstable
over time.

Try training with plain-vanilla SGD (no momentum nor weight decay).
Start with a low learning rate. Can you train successfully, even if slowly?
If so, try increasing the learning rate and possibly turning on momentum.

Although it often trains faster, Adam can be unstable sometimes.

Also, as a general rule, the learning rate with which Adam can train
stably tends to be numerically significantly smaller than those that
work with SGD. I would suggest starting with lr = 1.e-6 and increasing
it until you either get successful, if slow, training, or unstable training.

Good luck!

K. Frank

Thank you so much for the response! I took out any rows in my data that contained ‘nan’, changed Adam to SGD, and changed my learning rate to be much lower and everything seems to work now! I appreciate the help!