RuntimeError: NCCL Error 10: cuda malloc failed

Pytorch error, parallel model is used in training phase, and there is a problem in testing, how can I solve?
This situation only occurs when the model is reloaded!

Hm, haven’t seen this before. There’s a related thread here:

But maybe there the issue is due to some other problem. When i save models, I usually convert them to CPU first, because that keeps them most compatible. Then, when I reload them, I can put them on any GPU that are currently available.

Maybe you didn’t do that when you saved the model and when you try to load it, the original (or maybe not all) of the cuda devices (or the main cuda device) are not available?

I solved it. When I resume model, optimizier should be created after loading model, model.cuda() and Dataparallel(model).