Training loss is decreasing while validation loss is NaN

Hi all, I’m training a neural network with both CNN and RNN, but I found that although the training loss is consistently decreasing, the validation loss remains as NaN.

Here is an example:

Any idea why?

> 
> models = [model1, model2, model3]
> for epoch in range(epochs):
>     [model.train() for model in models]
>     for i, (data, label) in enumerate(dataloader_train):
>         data = Variable(data).cuda().float()
>         label = torch.squeeze(label)
>         label = Variable(label).cuda().float()
>         # CNN + RNN
>         batch_size, timesteps, C, H, W = data.size()
>         c_in = data.view(batch_size * timesteps, C, H, W)
>         c_out = model1(c_in)
>         _, C, H, W = c_out.size()
>         c_out = c_out.view(batch_size, timesteps, C, H, W)
>         r_out = model2(c_out)
>         pred = model3(r_out)
>         pred = torch.sigmoid(pred)
>         # training loss update
>         loss = criterion(pred, label)
>         optimizer.zero_grad()
>         loss.backward()
>         optimizer.step()
>         train_losses.append(loss.item())
>         if (i + 1) % 10 == 0:
>             print('Epoch:', epoch, 'Iter', i, 'Loss:', loss.item())

>     [model.eval() for model in models]
>     with torch.no_grad():
>         for i, (data, label) in enumerate(dataloader_valid):
>             data = Variable(data).cuda().float()
>             label = torch.squeeze(label)
>             label = Variable(label).cuda().float()
>             # CNN + RNN
>             batch_size, timesteps, C, H, W = data.size()
>             c_in = data.view(batch_size * timesteps, C, H, W)
>             c_out = model1(c_in)
>             _, C, H, W = c_out.size()
>             c_out = c_out.view(batch_size, timesteps, C, H, W)
>             r_out = model2(c_out)
>             pred = model3(r_out)
>             pred = torch.sigmoid(pred)
>             # validation loss update
>             loss = criterion(pred, label)
>             valid_losses.append(loss.item())

Here is a snippet of training and validation, I’m using a combined CNN+RNN network, model 1,2,3 are encoder, RNN, decoder respectively.

Could you check you are not introducing nans as input?

Actually I randomly split the data into training and validation set, so I don’t think it is the problem with the input, since the training loss is decreasing.

Can you post a code snippet of the Training and Validation?

Hi, I’ve posted a snippet, please take a look.

The code looks good.

  1. Since the code for train and test is the exact same, the bug had to be introduced during the data preprocessing or splitting. Do you keep a fixed random state while splitting your data? Can you show the code for how you have split the data, trainloader and testloader?
input_list = sorted(glob.glob(os.path.join(dataset_dir, "input/*.npy")))
output_list = sorted(glob.glob(os.path.join(dataset_dir, "output/*.npy")))

X_train, X_valid, y_train, y_valid = train_test_split(input_list, output_list, test_size=0.3, random_state=42)
scaler = pickle.load(open(scaler_path, 'rb'))

X_train, X_valid, y_train, y_valid = train_test_split(input_list, output_list, test_size=0.3, random_state=42)
scaler = pickle.load(open(scaler_path, 'rb'))

train_dataset = RFFullDataset3d(X_train, y_train, scaler)
valid_dataset = RFFullDataset3d(X_valid, y_valid, scaler)

dataloader_train = DataLoader(train_dataset, 32, shuffle=True)
dataloader_valid = DataLoader(valid_dataset, 32, shuffle=True)

Here is how I split and load the train/valid data, I tried different random state and they all have the same issue.

Give the test inputs as random inputs and then give all zeros as inputs and see if you get Nan as the loss in both the cases. This will help localize the error.

Thank you very much for your advice, I will check it.

Do let me know what the error is, if you find it.

Sure, of course.:grinning:

I tried different data patterns and the problem didn’t occur anymore. I also tried to clip the gradient in the RNN and found that it worked. So I guess it is probably a explosive gradient problem, but I couldn’t figure out why only the valid loss is NaN…

There are no gradients on validation… It’s better you to find which sample equals NaN and check why rather than randomly changing stuff

Would have to agree with Juan. There are no gradients during validation. I still feel it is something to do with the specific test data that you put in.

i too faced the same issue in LSTM while working with text input. Turns out my sequence length is too much and input is very small. so it is actually skipping the validation calculation steps

1 Like

Hi, for me, Test inputs of 0 are also giving nan output. What does this signify then? I am having the same problem with CNN + RNN, only the validation loss is nan and not train loss. Training proceeds normally. Even with the lstm layer removed I get nan as output.

Similar issue for me lol, setting a batch size too large may lead to the same problem