I had my code set for a single gpu. I used data parallel to train on multiple gpus. I am on a 4 gpu machine.
But now in training my losses are very large and my perplexity is just not going down from an extremely large value.
I want to ask:
- What changes do I need in batch size?
I see only the zeroth gpu using large memory while the rest of the gpus have under utilized memory.
- What changes do I need to make in learning rate?
I read somewhere I should increase my learning rate, but why?
- What is happening here in backprop? Are gradients accumulated in the zeroth gpu?
Please let me know AWS bills are rising