I am hyperparameter searching and want to utilize all GPU’s with DistributedDataParallel. Let’s say all my distributed training code is in main.py. I want to test different hyperparmeters in a bash script like this so that I can train both models at once:
python main.py --param1 45 --param2 20 &
python main.py --param1 33 --param2 41
Should I expect any funny business from DistributedDataParallel training like this?
main.py script assigns different
MASTER_PORT for different experiments (so that processes in the sample experiment can successfully rendezvous).
It should work with GLOO backend. But if you are using NCCL backend, it could hang, because NCCL requires using one communicator per device at a time. If there are multiple processes operating on the same CUDA device, DDP instances in different processes might launch
AllReduce concurrently. If you are lucky and the experiments do not hang, the result should still be correct I think.