I’m training a language model using the code here: https://github.com/pytorch/examples/blob/master/word_language_model/main.py
I have made some slight changes so that the model can be trained across multiple GPUs. However, the GPU memory usage is extremely imbalanced.
I can understand that one GPU is set to gather and store all outputs. I wonder if there is any way I can balance the memory usage? Or can I set one GPU for gathering outputs and the rest for training on batches?
Thanks,
Yuzhou