broadcast_buffers: flag that enables syncing (broadcasting) buffers of
the module at beginning of the forward function.
(default: True)
But what exactly does that mean? If I set it to False, from code it looks like the gradients are still reduced together, which should result in the same buffers on all the replicas on all the nodes? Please let me know if I am wrong.
What you’re saying is true for learnable parameters, but some modules like BatchNorm keep track of statistics (e.g. mean, variance) of the tensors that pass through them in buffers. In a multi-gpu setup, different GPUs will receive different inputs and so these statistics will be different. It is therefore necessary to synchronize them (which is what the broadcast_buffers flag does).
My guess is that this incurs a slight performance overhead, so if you’re not using such modules in your model you can disable it.