DataLoader produces NaNs with DDP

I have a weird and persistent problem where everything works fine when training with one GPU but when I move onto multi-GPU training some individual pixels in my input images become NaNs and this ofc crashes the training. It happens with random images and there is nothing wrong with the images as I check for NaNs in my Dataset.__call__ function. Then during training_step the NaNs magically appear, with random inputs at random pixels. So there may be problems inside the collate_fn?

Has anyone encountered anything similar?

@VitalyFedyunin for dataloader

Seems to be the same problem as 120733