DistributedDataParallel broadcast_buffers

I find this in the documentation:

     broadcast_buffers: flag that enables syncing (broadcasting) buffers of
                      the module at beginning of the forward function.
                      (default: True)

But what exactly does that mean? If I set it to False, from code it looks like the gradients are still reduced together, which should result in the same buffers on all the replicas on all the nodes? Please let me know if I am wrong.

2 Likes

What you’re saying is true for learnable parameters, but some modules like BatchNorm keep track of statistics (e.g. mean, variance) of the tensors that pass through them in buffers. In a multi-gpu setup, different GPUs will receive different inputs and so these statistics will be different. It is therefore necessary to synchronize them (which is what the broadcast_buffers flag does).

My guess is that this incurs a slight performance overhead, so if you’re not using such modules in your model you can disable it.

2 Likes

I think that documentation should mention exactly this information. That BN variance is also a buffer may not be obvious at first.