Hi, there, I’m new to distributed training, I’m confused about training neural networks on multiple machines and GPUs. Suppose that I have 2 machines, 1st machine is equipped with 2 TITAN X card, while 2nd has 4 1080Ti cards. How do I initialize the torch.distributed package? How do I actually train it? Can anyone explain the whole pipeline for this?
model = torch.nn.parallel.distributedDataParallel(model,device_ids=???)
Any help would be appreciated!
The initialization section gives you more information: http://pytorch.org/docs/master/distributed.html#initialization
Basically, if you use TCP Initialization, you can get the IP address (and some random port) of the first host machine, and use this as the address:port. For example:
This is the simplest and easiest initialization method.
If you have access to a shared file system, you can use the Shared file-system initialization methods.