I have some questions on Data parallel broadcasting. Dose it broadcast layer by layer or gather all parameters of the whole model? I can find some C++ codes of it in “/torch/csrc/cuda/comm.cpp”, such as :
I have two GPUs, one called GPU0 which is the main GPU and another is GPU1.
I have added some logs in the python files(torch/nn/parallel/comm.py). The results show that it broadcasts layer by layer.
If it broadcasts layer by layer, another question is that does the GPU1 receive the next layer’s parameters from GPU0 and make computation at the same time?
Thanks for your attention and answers !