I am going to train my model on multi-server (N servers), each of which includes 8 GPUs. It means that I want to train my model with 8*N GPUs.
I have checked the code provided by a tutorial, which is a code that uses distributed training to train a model on ImageNet. ( https://github.com/pytorch/examples/tree/master/imagenet )
I found that I need to run the training code on different server seperately just as the guide introduce :
Multiple nodes:
Node 0:
python main.py -a resnet50 --dist-url 'tcp://IP_OF_NODE0:FREEPORT' --dist-backend 'nccl' --multiprocessing-distributed --world-size 2 --rank 0 [imagenet-folder with train and val folders]
Node 1:
python main.py -a resnet50 --dist-url 'tcp://IP_OF_NODE0:FREEPORT' --dist-backend 'nccl' --multiprocessing-distributed --world-size 2 --rank 1 [imagenet-folder with train and val folders]
I am wondering is it possible to input the command on one server and then run the codes on different servers simultaneously and automatically?
Thank you!