Pytorch Distributed Data Parallel - How can I pass the same information to all processes

amogh112 · August 2, 2020, 7:21pm

I am using Distributed Data Parallel wich instantiates multiple processes to train model on multiple GPUs. I want to be able to save each experiment run backup to a single new folder (let’s say by passing the same timestamp to all the processes). However some processes being delayed by a second leads to a different timestamp within the same experiment run. Is it possible to pass the same information(single timestamp) to all the processes?

I start my script with:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NCCL_LL_THRESHOLD=0 python \
-i \
-m torch.distributed.launch \
--master_port=9997 \
--nproc_per_node=8 \
main.py .....

Thanks

mrshenli · August 2, 2020, 8:22pm

Does it work if you let rank0 process to broadcast its timestamp to other processes?

amogh112 · August 3, 2020, 3:48am

Thanks a lot, that worked!
I have one question though, I broadcast process with local rank 5’s seconds as a tensor, and thus all the processes have the same seconds after broadcasting.
So how does it work, is it possible that a faster process uses an old value of the variable that is broadcasted later by the src process? Or do all the processes wait until the value has been broadcasted by the src variable?

Current time on machine is : 3 2020-08-02_21:52:52
Current time on machine is : 4 2020-08-02_21:52:52
Current time on machine is : 2 2020-08-02_21:52:52
Current time on machine is : 5 2020-08-02_21:52:52
Current time on machine is : 7 2020-08-02_21:52:52
Current time on machine is : 6 2020-08-02_21:52:53
Current time on machine is : 1 2020-08-02_21:52:53
Current time on machine is : 0 2020-08-02_21:52:53

Before Broadcasting seconds: tensor([52], device='cuda:4')
Before Broadcasting seconds: tensor([52], device='cuda:3')
Before Broadcasting seconds: tensor([52], device='cuda:2')
Before Broadcasting seconds: tensor([52], device='cuda:7')
Before Broadcasting seconds: tensor([52], device='cuda:5')
Before Broadcasting seconds: tensor([53], device='cuda:1')
Before Broadcasting seconds: tensor([53], device='cuda:0')
Before Broadcasting seconds: tensor([53], device='cuda:6')

<broadcast using torch.distributed.broadcast(LongTensor(seconds), src=5)>

After Broadcasting seconds  tensor([52], device='cuda:6')
After Broadcasting seconds  tensor([52], device='cuda:1')
After Broadcasting seconds  tensor([52], device='cuda:7')
After Broadcasting seconds  tensor([52], device='cuda:2')
After Broadcasting seconds  tensor([52], device='cuda:0')
After Broadcasting seconds  tensor([52], device='cuda:5')
After Broadcasting seconds  tensor([52], device='cuda:4')
After Broadcasting seconds  tensor([52], device='cuda:3')

mrshenli · August 3, 2020, 2:25pm

In the broadcast API, there is an async_op, which by default is False. If it is False, all processes will block until the value is broadcasted. Otherwise, if it is True, broadcast will be non-blocking and return a Future-like object, and you can call wait() on that object. In this case, the tensor is only guaranteed to hold the result after wait() returns.