Is there a way to perform distributed data parallelism within a single node across multiple GPUs? DataParallel copies the model each time and that seems to slow things down significantly for me.
Doesn’t distributed data parallelism also synchronize?
Yes. Right. A synchronized version of DataParallel without model replication during each forward pass. How can that be done?