Distributed Training by Pytorch

Junwu_Weng · September 26, 2019, 7:37am

I am going to train my model on multi-server (N servers), each of which includes 8 GPUs. It means that I want to train my model with 8*N GPUs.

I have checked the code provided by a tutorial, which is a code that uses distributed training to train a model on ImageNet. ( https://github.com/pytorch/examples/tree/master/imagenet )

I found that I need to run the training code on different server seperately just as the guide introduce :

Multiple nodes:

Node 0:

python main.py -a resnet50 --dist-url 'tcp://IP_OF_NODE0:FREEPORT' --dist-backend 'nccl' --multiprocessing-distributed --world-size 2 --rank 0 [imagenet-folder with train and val folders]

Node 1:

python main.py -a resnet50 --dist-url 'tcp://IP_OF_NODE0:FREEPORT' --dist-backend 'nccl' --multiprocessing-distributed --world-size 2 --rank 1 [imagenet-folder with train and val folders]

I am wondering is it possible to input the command on one server and then run the codes on different servers simultaneously and automatically?

Thank you!

Junwu_Weng · September 26, 2019, 11:28am

Any help would be appreciated !

nexgus · October 4, 2019, 7:23am

Maybe you can try something like ansible to deploy your application to multiple machines.

Yaroslav_Bulatov · October 4, 2019, 8:35pm

This depends on your cluster/cloud.

For AWS, I have an example here: https://github.com/cybertronai/pytorch-aws/
Basically you set your AWS credentials and then do

python mnist_distributed.py --mode=remote --nnodes=2