I think in DataParallel mode, only one model has been established, while in DistributedDataParallel, each compute node has one independent model. So when you pass data, Dataparallel just splits input to different nodes and after forward , loss must be passed to master node. To concretely, you can refer this link: Is average the correct way for the gradient in DistributedDataParallel with multi nodes?