How to convert a single-GPU PyTorch script to a multi-GPU multi-node PyTorch script with DDP?

Olivier-CR · October 20, 2021, 7:59pm

How to convert a single-GPU PyTorch script to a multi-GPU multi-node PyTorch script with DDP?
I read this 10 times already but honestly it’s not really helpful ; what I need is a place that lists the modification needed to convert a single-GPU code to a multi-node multi-GPU code.

Is there a place in the doc that explains how to distribute a PyTorch training script over multiple machines?

Syzygianinfern0 · October 26, 2021, 12:02pm

Check out these resources. They helped me understand how to do it. I agree that the docs as of now are not to the point.

Distributed data parallel training in Pytorch
GitHub - GoldenRaven/Pytorch_DistributedParallel_GPU_test: Pytorch distributed data parallel test of GPU on MNIST
https://towardsdatascience.com/how-to-convert-a-pytorch-dataparallel-project-to-use-distributeddataparallel-b84632eed0f6
Distributed Training in PyTorch (Distributed Data Parallel) | by Praneet Bomma | Analytics Vidhya | Medium

cbalioglu · October 26, 2021, 7:06pm

Hey @Olivier-CR, do you mind opening an issue in the PyTorch repository for this? There is definitely room for improvement in our documentation and this would help us to prioritize it in the near future. Thanks!