Major Issue when doing multip gpu training (Loss gets worse)

Rafael_R · September 2, 2019, 1:17am

Hi,

I had my code set for a single gpu. I used data parallel to train on multiple gpus. I am on a 4 gpu machine.

But now in training my losses are very large and my perplexity is just not going down from an extremely large value.

I want to ask:

I see only the zeroth gpu using large memory while the rest of the gpus have under utilized memory.

I read somewhere I should increase my learning rate, but why?

What is happening here in backprop? Are gradients accumulated in the zeroth gpu?

Please let me know AWS bills are rising