After Update pytorch from 1.1.0 to 1.7.0. DDP Model cannot converge

sunshichen · December 14, 2020, 6:18am

I trained a 3D unet model by Pytorch 1.1.0 Distributed Data Parallel. The test result is good. But after I update Pytorch to V1.7.0. Same code different result. Anyone can help me with that? Is it a syncbatchnorm problem or what?

Even if I finetune the pretrained model(on 1.1.0). Both losses during training and validation increased a lot using v1.7.0.

Below is the training epoch loss of two version. loss

ptrblck · December 14, 2020, 8:09am

Are you only seeing the increase in the losses while using DDP or also while training the model on a single GPU?

sunshichen · December 14, 2020, 8:37am

I only tested on DDP. Because my input data is too large to put in single GPU. So I have to use DDP with syncbatchnorm.
P.S. Could it be a cuda version problem? I used Pytorch1.7+cu101, however my server still using cuda10.

ptrblck · December 14, 2020, 9:34am

It’s unclear where the difference is coming from and I would recommend to check the loss curves on a single GPU first (by reducing the batch size or slimming down the model).
The next steps depend on the outcome of this run (e.g. using DDP again and removing SyncBN etc.).

sunshichen · December 14, 2020, 9:45am

The input data is too large that it can only fit into GPU with batchsize=1. This is why I have to choose syncbatchnorm with DDP to simulate larger batchsize. By the way, groupnorm took much more GPU memory(Why I didn’t use it.). Do you have any idea where should I check first except for try single GPU?

ptrblck · December 14, 2020, 9:55am

Try to eliminate as many moving parts as possible in order to isolate it.

sunshichen · December 14, 2020, 9:57am

Can you give me more details? Like an example?

P.S.
Even though I trained on 4 GPU(simulate bs=4) still converge much slower than before(2 GPU, simulate bs=2)

loss2

Hzzone · December 25, 2020, 5:11am

I have also met this problem when I tried to reproduce GLOW using this code, GitHub - rosinality/glow-pytorch: PyTorch implementation of Glow with torch 1.7+cu101. It works ok with DP.