Is distributed model performance somewhat in proportion to non-distributed model performance?


When using the torch.nn.parallel.DistributedDataParallel for multi-process training, should the performance (e.g. loss) be somewhat proportionally better (where the proportion depends on the number of processes) to the performance in a single-process setting? My intuition is that since the dataset in a multi-process setting is subdivided to the individual processes, the model will have an easier time predicting these partitioned datasets and hence have better training metrics compared to when the entire dataset is predicted in a single-process setting. Can someone confirm if my intuition is correct?

the distributed model usually requires communication to sync model’s gradients and thus usually it is not linear scaling compared to local training