Multi-Machine and Muiti-GPU training

Hi, there, I’m new to distributed training, I’m confused about training neural networks on multiple machines and GPUs. Suppose that I have 2 machines, 1st machine is equipped with 2 TITAN X card, while 2nd has 4 1080Ti cards. How do I initialize the torch.distributed package? How do I actually train it? Can anyone explain the whole pipeline for this?

model = torch.nn.parallel.distributedDataParallel(model,device_ids=???)

Any help would be appreciated!

The initialization section gives you more information:

Basically, if you use TCP Initialization, you can get the IP address (and some random port) of the first host machine, and use this as the address:port. For example: init_method='tcp://'

This is the simplest and easiest initialization method.

If you have access to a shared file system, you can use the Shared file-system initialization methods.