Distributed training gives nan loss but single GPU training is fine

I ran into the exact same problem.
Any chance that you have eventually found what was the problem?

Thanks!

1 Like