Infinitive or nan batch loss encountered when shuffling the training data

Yozey · April 18, 2017, 1:50pm

Problem:

Hello everyone, I’m working on the code of transfer_learning_tutorial by switching my dataset to do the finetuning on Resnet18.

I’ve encountered a situation where the batch loss turns quickly (at 2nd ot 3rd batch) to infinitive and then nan if I shuffle the training data in my data_loader.

Solution tried

However, if i give the option shuffle=False in the dataloader, the problem is still there but appears much later and randomly(the batch where the loss turns to nan is different each time I run the code) . But sometimes one epoch could be completely finished so that I supposed it might not be the problem of my dataset.

I use a initial learning rate of 1e-5 with learning rate decay by 7 epochs.

##Description of code
My images and labels are saved in a large hdf5 file (uint8 for images and float32 for labels). I rewrite the dataloader and apply some preprocessing on my data. I use the MSELoss as the loss function and all other parts are quite similar to the tutorial.

    class MyDataset(torch.utils.data.Dataset):
        def __init__(self, image_h5, label_h5):
             assert image_h5.shape[0] == label_h5.shape[0]
             self.image_h5 = image_h5
             self.label_h5 = label_h5

        def __getitem__(self, index):
             return torch.from_numpy(((self.image_h5[index]-MEAN_MATRIX)/255).astype(np.float32)), torch.from_numpy(self.label_h5[index])

        def __len__(self):
             return self.image_h5.shape[0]
train_set = MyDataset(train_file['image_set'],train_file['label_set'])
test_set = MyDataset(test_file['image_set'],test_file['label_set'])
dset_loaders ={'train':torch.utils.data.DataLoader(train_set, batch_size=opt.batch_size,
                                               shuffle=True, num_workers=opt.threads),
				'val':torch.utils.data.DataLoader(test_set, batch_size=opt.batch_size,
                                               shuffle=False, num_workers=opt.threads)}

I would like to know the probable reason for this problem and if I’ve rewritten correctly my dataloader.

Thank you very much

smth · April 18, 2017, 2:26pm

one of your image is either loading with nans, or it has all zeros (maybe generating NaNs with batchnorm because of divide by zero)

Yozey · April 18, 2017, 3:01pm

Thank you @smth But since simetimes I can finish one whole epoch with option shuffle=False so that I think my dataset works fine.

I’m suspecting that my dataset does not converge with block structure like resnet. As a result i’m trying the more classic VGG structure.

However, when i’m trying VGG 16 or VGG 13 structure, I’ve got the following errors:

> Traceback (most recent call last):
>   File "pytorch_finetuning.py", line 171, in <module>
>     num_ftrs = model_conv.fc8.in_features
>   File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 237, in __getattr__
>     return object.__getattr__(self, name)
> AttributeError: type object 'object' has no attribute '__getattr__'

I would like to ensure if VGG structure supports the call like model.fc.in_features or model.fc8.in_features

Yozey · April 27, 2017, 4:17pm

@smth I continue to work on my code and try to solve the problem of nan/inf batch loss. I found that the problem is related to the num_workers parameter in the torch.utils.data.DataLoader.

I have got an intel i7-6700HQ CPU with 4 cores and 8 threads. If I give a bigger value for num_workers eg. 6 or 7, there are more chances to have inf and nan batch loss. In contrast, if a value less than 4 is given, the inf or nan will never appear. I’m wondering if a bigger value of num_workers will create some conflicts.

In the other hand, I tried the num_workers=4 and num_workers=1 but I found almost no acceleration of speed. I found there seems no multi-processing according to my system resource monitor.

I would like to know if I am using this parameter correctly. My code is written in my first post. Thanks.

karmus89 · February 27, 2018, 12:10pm

Hello @Yozey!

I know this is late as in almost a year late, but I faced a similar problem with DataLoader's num_workers parameter. For you or anyone else reading this, please see my post in this forum about this. Also read my response below.

For me the problem was that some of the images were of inconsistent size, but this was only revealed when shuffling the dataset and even then just by chance.