Hi, there, I’m new to distributed training, I’m confused about training neural networks on multiple machines and GPUs. Suppose that I have 2 machines, 1st machine is equipped with 2 TITAN X card, while 2nd has 4 1080Ti cards. How do I initialize the torch.distributed package? How do I actually train it? Can anyone explain the whole pipeline for this?
torch.distributed.init_process_group("gloo",init_method=???,world_size=???)
model = torch.nn.parallel.distributedDataParallel(model,device_ids=???)
Basically, if you use TCP Initialization, you can get the IP address (and some random port) of the first host machine, and use this as the address:port. For example: init_method='tcp://10.1.1.20:23456'
This is the simplest and easiest initialization method.