NaN validation loss with batch size 1

ksmdanl · February 21, 2022, 3:01pm

I implemented SSD using resnet18 backbone and while the model trains I got NaN loss during the validation on some images.
After further exploration degenerate boxes were found and then I did the tricks by applying some kind of filter that accepts only non degenerate boxes.
This doesnt fix the issue. What happens is on each of these images no bounding boxes are detected so a zero array and the ground truth passed to the loss function and it returns a NaN.
Looking on the images, where the NaNs occur, I see no anomaly on how the data is pre processed, both the images and the labels, i.g. from these images I see, at least for me, reasonable distribution after preprocessing.
Could it be that it traces back to the batch size that I use, which is 1?
Looking forward for the exchange!