Multi-Machine and Muiti-GPU training

Hi, there, I’m new to distributed training, I’m confused about training neural networks on multiple machines and GPUs. Suppose that I have 2 machines, 1st machine is equipped with 2 TITAN X card, while 2nd has 4 1080Ti cards. How do I initialize the torch.distributed package? How do I actually train it? Can anyone explain the whole pipeline for this?

torch.distributed.init_process_group("gloo",init_method=???,world_size=???)
model = torch.nn.parallel.distributedDataParallel(model,device_ids=???)
3 Likes

Any help would be appreciated!

The initialization section gives you more information: http://pytorch.org/docs/master/distributed.html#initialization

Basically, if you use TCP Initialization, you can get the IP address (and some random port) of the first host machine, and use this as the address:port. For example: init_method='tcp://10.1.1.20:23456'

This is the simplest and easiest initialization method.

If you have access to a shared file system, you can use the Shared file-system initialization methods.