Training loss is decreasing while validation loss is NaN

fan_percy · June 18, 2019, 12:42am

Hi all, I’m training a neural network with both CNN and RNN, but I found that although the training loss is consistently decreasing, the validation loss remains as NaN.

Here is an example:

Any idea why?

> 
> models = [model1, model2, model3]
> for epoch in range(epochs):
>     [model.train() for model in models]
>     for i, (data, label) in enumerate(dataloader_train):
>         data = Variable(data).cuda().float()
>         label = torch.squeeze(label)
>         label = Variable(label).cuda().float()
>         # CNN + RNN
>         batch_size, timesteps, C, H, W = data.size()
>         c_in = data.view(batch_size * timesteps, C, H, W)
>         c_out = model1(c_in)
>         _, C, H, W = c_out.size()
>         c_out = c_out.view(batch_size, timesteps, C, H, W)
>         r_out = model2(c_out)
>         pred = model3(r_out)
>         pred = torch.sigmoid(pred)
>         # training loss update
>         loss = criterion(pred, label)
>         optimizer.zero_grad()
>         loss.backward()
>         optimizer.step()
>         train_losses.append(loss.item())
>         if (i + 1) % 10 == 0:
>             print('Epoch:', epoch, 'Iter', i, 'Loss:', loss.item())

>     [model.eval() for model in models]
>     with torch.no_grad():
>         for i, (data, label) in enumerate(dataloader_valid):
>             data = Variable(data).cuda().float()
>             label = torch.squeeze(label)
>             label = Variable(label).cuda().float()
>             # CNN + RNN
>             batch_size, timesteps, C, H, W = data.size()
>             c_in = data.view(batch_size * timesteps, C, H, W)
>             c_out = model1(c_in)
>             _, C, H, W = c_out.size()
>             c_out = c_out.view(batch_size, timesteps, C, H, W)
>             r_out = model2(c_out)
>             pred = model3(r_out)
>             pred = torch.sigmoid(pred)
>             # validation loss update
>             loss = criterion(pred, label)
>             valid_losses.append(loss.item())

Here is a snippet of training and validation, I’m using a combined CNN+RNN network, model 1,2,3 are encoder, RNN, decoder respectively.

JuanFMontesinos · June 18, 2019, 3:18pm

Could you check you are not introducing nans as input?

fan_percy · June 19, 2019, 4:30am

Actually I randomly split the data into training and validation set, so I don’t think it is the problem with the input, since the training loss is decreasing.

charan_Vjy · June 19, 2019, 7:11am

Can you post a code snippet of the Training and Validation?

fan_percy · June 19, 2019, 7:53am

Hi, I’ve posted a snippet, please take a look.

charan_Vjy · June 19, 2019, 8:38am

The code looks good.

Since the code for train and test is the exact same, the bug had to be introduced during the data preprocessing or splitting. Do you keep a fixed random state while splitting your data? Can you show the code for how you have split the data, trainloader and testloader?

fan_percy · June 19, 2019, 11:50am

input_list = sorted(glob.glob(os.path.join(dataset_dir, "input/*.npy")))
output_list = sorted(glob.glob(os.path.join(dataset_dir, "output/*.npy")))

X_train, X_valid, y_train, y_valid = train_test_split(input_list, output_list, test_size=0.3, random_state=42)
scaler = pickle.load(open(scaler_path, 'rb'))

X_train, X_valid, y_train, y_valid = train_test_split(input_list, output_list, test_size=0.3, random_state=42)
scaler = pickle.load(open(scaler_path, 'rb'))

train_dataset = RFFullDataset3d(X_train, y_train, scaler)
valid_dataset = RFFullDataset3d(X_valid, y_valid, scaler)

dataloader_train = DataLoader(train_dataset, 32, shuffle=True)
dataloader_valid = DataLoader(valid_dataset, 32, shuffle=True)

Here is how I split and load the train/valid data, I tried different random state and they all have the same issue.

charan_Vjy · June 19, 2019, 11:54am

Give the test inputs as random inputs and then give all zeros as inputs and see if you get Nan as the loss in both the cases. This will help localize the error.

fan_percy · June 19, 2019, 11:57am

Thank you very much for your advice, I will check it.

charan_Vjy · June 19, 2019, 12:42pm

Do let me know what the error is, if you find it.

fan_percy · June 19, 2019, 12:59pm

Sure, of course.

fan_percy · June 21, 2019, 2:41pm

I tried different data patterns and the problem didn’t occur anymore. I also tried to clip the gradient in the RNN and found that it worked. So I guess it is probably a explosive gradient problem, but I couldn’t figure out why only the valid loss is NaN…

JuanFMontesinos · June 21, 2019, 3:17pm

There are no gradients on validation… It’s better you to find which sample equals NaN and check why rather than randomly changing stuff

charan_Vjy · June 21, 2019, 3:25pm

Would have to agree with Juan. There are no gradients during validation. I still feel it is something to do with the specific test data that you put in.

SivaHemanth24 · August 9, 2020, 7:21am

i too faced the same issue in LSTM while working with text input. Turns out my sequence length is too much and input is very small. so it is actually skipping the validation calculation steps

prat · May 4, 2021, 12:45pm

Hi, for me, Test inputs of 0 are also giving nan output. What does this signify then? I am having the same problem with CNN + RNN, only the validation loss is nan and not train loss. Training proceeds normally. Even with the lstm layer removed I get nan as output.

yfflood · January 14, 2024, 2:24am

Similar issue for me lol, setting a batch size too large may lead to the same problem