Model predictions in eval mode are all nans

I’m currently working on a multi-label classification problem using a dataset containing images of Ocular Disease. I’ve written a custom PyTorch Dataset implementation to create a loader that will, for every patient ID, return the left and the right image, as well as the associated 8D multi-hot target vector.

The following is my Dataset class code:

class CustomDataset(
    def __init__(self, IDs, data_dir, annotations, transforms=None):
        self.transforms = transforms
        self.annotations = annotations
        self.data_dir = data_dir
        # read the excel file.
        mapper = pd.read_excel(self.annotations, sheet_name=0)
        mapper.ID = mapper.ID.astype(str)
        # 'ID' is the primary key.
        self.subset_df = mapper.loc[mapper['ID'].isin(IDs), :]
        print("Found {} entries.".format(self.subset_df.shape[0]))
        self.subset_df.reset_index(drop=True, inplace=True)
        self.ix2id = {
            ix:id for ix, id in zip(list(self.subset_df.index), list(self.subset_df.ID))

    def __len__(self):
        return len(self.ix2id)

    def __getitem__(self, index):
        # Sample from IDs
        sampled_id = self.ix2id[index]
        row = self.subset_df.loc[self.subset_df.ID == sampled_id, :]
        left_im_name = row['Left-Fundus'].values[0]
        right_im_name = row['Right-Fundus'].values[0]
        left_im = + left_im_name, "r")
        right_im = + right_im_name, "r")
        if self.transforms is not None:
            left_im = self.transforms(left_im)
            right_im = self.transforms(right_im)
        y = row[['N','D','G','C','A','H','M','O']].values.flatten()
        return (left_im, right_im), y

The class is instantiated with a list of IDs (split separately). The following is the instantiation:

train_transforms = transforms.Compose([
    transforms.Resize((IM_HEIGHT, IM_WIDTH)),
    transforms.Normalize(mean=mean, std=std)

val_transforms = transforms.Compose([
    transforms.Resize((IM_HEIGHT, IM_WIDTH)),
    transforms.Normalize(mean=mean, std=std)

train = CustomDataset(train_split, data_dir='path_to_im', annotations='somepath', transforms=train_transforms)

test = CustomDataset(test_split, data_dir='path_to_im', annotations='somepath', transforms=val_transforms)

train_dataloader =, batch_size=BATCH_SIZE, shuffle=True, num_workers=5)

test_dataloader =, batch_size=BATCH_SIZE, shuffle=True, num_workers=5)

The model training is without a hitch. The evaluation, however, is where the model returns nans. The following is the eval() code:

def eval(epoch_num):
    epoch_loss = 0.
    y_true = []
    y_pred = []
    for ix, ((left_im, right_im), target) in enumerate(test_dataloader):
        # Push on GPU.
        left_im =
        right_im =
        target =
        with torch.no_grad():
            left_im, right_im, target = Variable(left_im), Variable(right_im), Variable(target)
        # Pass both image batches through.
        preds = model(left_im, right_im)
        # !!! Preds is a (B, 8) tensor full of nans !!!

        loss = criterion(preds, target)
        epoch_loss += loss.item()
        pred_cpu = preds.detach().cpu().numpy()
    y_true = np.concatenate(y_true, axis=0)
    y_pred = np.concatenate(y_pred, axis=0)
    print("Eval loss: {}".format(epoch_loss/len(test_dataloader)))
    metrics = custom_Metrics(gt_data=y_true, pr_data=y_pred)
    print("Eval metrics: {}".format(metrics))
    metrics['loss'] = epoch_loss
    return metrics

What seems to be wrong here?

The model is a simple CNN that passes both images through a ‘backbone’ (for shared parameters), concatenates the latent vectors, and passes it through a couple of Linear layers. I’m using the BCEWithLogitsLoss criterion and an Adam optimizer and an initial learning rate of 5e-4.

I am also seeing nan in my Transformer in eval mode!! @Rahul_Kulhalli Did you happen to figure out why?

@Rahul_Kulhalli could you share the image size of the same.
I have also encountered nan issues while training for maskrcnn but it could be fixed by
proper annotations . As the task you are performing is a supervised task could you recheck the target label and if possible post the loss curve obtained while training. I suspect you could see it shooting up before returning nan