Major Issue when doing multip gpu training (Loss gets worse)


I had my code set for a single gpu. I used data parallel to train on multiple gpus. I am on a 4 gpu machine.

But now in training my losses are very large and my perplexity is just not going down from an extremely large value.

I want to ask:

  1. What changes do I need in batch size?

I see only the zeroth gpu using large memory while the rest of the gpus have under utilized memory.

  1. What changes do I need to make in learning rate?

I read somewhere I should increase my learning rate, but why?

  1. What is happening here in backprop? Are gradients accumulated in the zeroth gpu?

Please let me know :slight_smile: AWS bills are rising :frowning: