DataParallel caching replicate()

Is there a way to perform distributed data parallelism within a single node across multiple GPUs? DataParallel copies the model each time and that seems to slow things down significantly for me.

Doesn’t distributed data parallelism also synchronize?

Yes. Right. A synchronized version of DataParallel without model replication during each forward pass. How can that be done?