Exactly same loss curve for different LR on multi-gpus

Shangyin_Gao · September 10, 2019, 11:12am

I designed a model for object detection and trained on one gpu, the loss is decreasing and got a good result.

But, when I was trying to train the same network on multi-gpus, the loss is not decreasing. And more strange thing is that, the loss curve is exactly the same not matter the lr is 0.05 or 0.0005.

The dataset and training pipeline are used for other models and working good on multi-gpus.

Anybody got an idea, what the potential reason could be?

If you need more details, just write on the comment.

spanev · September 10, 2019, 12:05pm

Hi @Shangyin_Gao,

What’s the single-GPU LR? Did you also try to scale it such as LR = LR * num_gpus?

Shangyin_Gao · September 10, 2019, 12:11pm

for the single I use 0.02, and for 2 gpus I tried LR from 0.0005 to 0.05