Hello everyone, I’m working on the code of transfer_learning_tutorial by switching my dataset to do the finetuning on Resnet18.
I’ve encountered a situation where the batch loss turns quickly (at 2nd ot 3rd batch) to infinitive and then nan if I shuffle the training data in my data_loader.
Solution tried
However, if i give the option shuffle=False in the dataloader, the problem is still there but appears much later and randomly(the batch where the loss turns to nan is different each time I run the code) . But sometimes one epoch could be completely finished so that I supposed it might not be the problem of my dataset.
I use a initial learning rate of 1e-5 with learning rate decay by 7 epochs.
##Description of code
My images and labels are saved in a large hdf5 file (uint8 for images and float32 for labels). I rewrite the dataloader and apply some preprocessing on my data. I use the MSELoss as the loss function and all other parts are quite similar to the tutorial.
@smth I continue to work on my code and try to solve the problem of nan/inf batch loss. I found that the problem is related to the num_workers parameter in the torch.utils.data.DataLoader.
I have got an intel i7-6700HQ CPU with 4 cores and 8 threads. If I give a bigger value for num_workers eg. 6 or 7, there are more chances to have inf and nan batch loss. In contrast, if a value less than 4 is given, the inf or nan will never appear. I’m wondering if a bigger value of num_workers will create some conflicts.
In the other hand, I tried the num_workers=4 and num_workers=1 but I found almost no acceleration of speed. I found there seems no multi-processing according to my system resource monitor.
I would like to know if I am using this parameter correctly. My code is written in my first post. Thanks.
I know this is late as in almost a year late, but I faced a similar problem with DataLoader's num_workers parameter. For you or anyone else reading this, please see my post in this forum about this. Also read my response below.
For me the problem was that some of the images were of inconsistent size, but this was only revealed when shuffling the dataset and even then just by chance.