Hey guys, I have encountered a complicated problem.
During model training, exactly at the end of each epoch, I first save the model (and optimizer, of course) and then evaluate the model. And I try to resume this model, but get some troubles:
When I try to continue training, the loss changes compared to the previous results (increase a lot). But if I only use 1 GPU for training, the loss looks unchanged. However, as I use more GPUs, the loss increases more than the previous results (I only tried on 1/2/8 GPUs).
When I try to evaluate it again, the accuracy significantly drops (27~ vs. 3~, drops a lot!), no matter I use how many GPUs.
My project is so large that I can’t show it here. But I use the same framework with ResNet18 for Image Classification, and it works well.
Does anyone have some clues about what situation may cause these problems? Though I can’t provide my code here, I’ll check or try what you said.
Thanks a lot. It keeps me up all night.
I tried a lot. And I found that the model parameters on GPU 0 and GPU 1 are inconsistent. That makes me really confused.
It will be easier to get an answer if you provide some code snippet on how you save and load the models, how do you convert from single GPU training to multiGPU training etc. Without a reference, it’s difficult to hypothesize on what might be wrong.
Thanks for your reply!
I’m sorry for not providing some code snippets for this discussion (at that time, I make sure that my framework, like save and load methods are correct since I already use this framework for a long time).
Now I have identified the cause of this problem, and I will explain all this in my next reply.
My mistake was that I didn’t call
forward() after using DDP. That’s because my model design is complicated. Thus, combining the forward processes of all modules into the same
forward() function is difficult.
If using DDP without calling the DDP model’s
forward() function, all parameters in the model will not be synced between multiple GPUs. This will cause serious problems.
More explanation in another discussion.