I have a weird and persistent problem where everything works fine when training with one GPU but when I move onto multi-GPU training some individual pixels in my input images become NaNs and this ofc crashes the training. It happens with random images and there is nothing wrong with the images as I check for NaNs in my Dataset.__call__
function. Then during training_step
the NaNs magically appear, with random inputs at random pixels. So there may be problems inside the collate_fn
?
Has anyone encountered anything similar?