Hi, I am curious why this require_forward_param_sync is set to True in the DistributedDataParallel. After I manually set it to False, the multi-gpus in one node speeds up a lot. Since the gradients of each replica have been synchronized, why do we need to synchronize the parameters in the forward?
One of the things this flag controls is whether to broadcast model
buffers in each iteration to ensure that module buffers are synchronized across processes. You can evaluate the speed up of your model with
broadcast_buffers=False and remove it if model accuracy permits.
Thanks for your reply, by setting broadcast_buffers to False indeed achieve the same speed up. Therefore, the broadcast_buffers can be safely set to False if anyone of the following holds
(1) there is no buffer in the model;
(2) the buffer will never be updated.
is this correct?
Are you sure there are no buffers in your module? If that is the case it is surprising that you’re seeing a speedup because this synchronization would be a noop (there are no buffers to synchronize).
Can you confirm this by checking the
self.named_buffers() attr of your nn.Module you’re passing into ddp?
There are some small buffers in my model, but they are just some constant matrix, the synchronization might take some time.
Another interesting speedup also confuses me a lot, in the Megatron-LM (GitHub - NVIDIA/Megatron-LM: Ongoing research training transformer language models at scale, including: BERT & GPT-2) package, the model will be explicitly converted into the Float16, by calling something like model.half() before training the model with fp16 enabled, and by doing so the backward time will be largely reduced (e.g. the time takes to finish loss.backward()). I think the backward time should be agnostic to whether the model is float16 or not since the gradients are fp16 when the fp16 is enabled. Do you have any hint about why this may happen?