I am facing an issue when back-propagating the loss due to large batch size due to the accumulation of the 8 GPU DataParallel. Is there is a way to divide the loss calculation into 8, then normalize the loss then use the loss.backward() but this time it should be smaller in size or how i can work this around ?
I assume you are using DataParallel. I would suggest using DDP as it calculates the loss separately on each worker after syncing the gradients.
@cbalioglu can you recommend some resources please to learn how to efficiently apply DDP to my large model using multiple gpus ?