Running multiple versions of DistributedDataParallel in bash script

btil · December 22, 2020, 8:34pm

I am hyperparameter searching and want to utilize all GPU’s with DistributedDataParallel. Let’s say all my distributed training code is in main.py. I want to test different hyperparmeters in a bash script like this so that I can train both models at once:

python main.py --param1 45 --param2 20 &
python main.py --param1 33 --param2 41

Should I expect any funny business from DistributedDataParallel training like this?

mrshenli · December 27, 2020, 5:08am

Assuming you main.py script assigns different MASTER_PORT for different experiments (so that processes in the sample experiment can successfully rendezvous).

It should work with GLOO backend. But if you are using NCCL backend, it could hang, because NCCL requires using one communicator per device at a time. If there are multiple processes operating on the same CUDA device, DDP instances in different processes might launch AllReduce concurrently. If you are lucky and the experiments do not hang, the result should still be correct I think.