DDP on 8 GPUs vs. Single GPU training speed

headkit (H.H.) May 25, 2021, 10:33am 3

Maybe you can find a solution together:

DistributedDataParallel training not efficient distributed

Very interesting project! So basically training with 4 GPUS needs 4 epochs to get the same results like a single GPU achieves in only 1 epoch. This is not true if you consider the sync among 4 GPUs per epoch. It should be equivalent to running 4 epochs on a single GPU. Can you confirm if there is any communication between different processes (by printing the gradient values of different ranks after backward)? Gradients of different ranks should be the same after backward. Additionally, you…

show post in topic

Home
Categories
Guidelines
Terms of Service
Privacy Policy

Powered by Discourse, best viewed with JavaScript enabled