BatchNorm for multi GPU Training

Hi,
I did read that PyTorch is not supporting the so called sync BatchNorm. This is needed to
train on multi GPU machines. By question is: Are there any plans to implement sync BatchNorm
for PyTorch and when will it be released?

An other question: What is the best workaround when you want to train with images and need
large batch sizes?

Thanks
Philip

1 Like

SyncBatchNorm is already in PyTorch.

Hi @ptrblck ,
thanks for the answer.
The documentation sais:

Currently SyncBatchNorm only supports DistributedDataParallel with single GPU per process.

This “single GPU” kind of irritates me. What does it mean?

I am also asking because detectron2 still uses “FrozenBatchNorm2d”: https://github.com/facebookresearch/detectron2/blob/master/detectron2/modeling/backbone/resnet.py#L50

1 Like

DistributedDataParallel can be used in two different setups as given in the docs.

  1. Single-Process Multi-GPU and
  2. Multi-Process Single-GPU, which is the fastest and recommended way.

SyncBatchNorm will only work in the second approach.

I’m not sure, if you would need SyncBatchNorm, since FrozenBatchNorm seems to fix all buffers:

BatchNorm2d where the batch statistics and the affine parameters are fixed.
It contains non-trainable buffers called
“weight” and “bias”, “running_mean”, “running_var”,
initialized to perform identity transformation.

2 Likes

Hi @ptrblck,

How do I create my DDP model if I’m working on a cluster with multiple nodes and each node may have multiple GPUs ?

I think this tutorial might be a good introduction to the different backends etc.

1 Like

hi, i want to know the difference between sycBN and BN.

SyncBatchNorm synchronizes the statistics during training in a DistributedDataParallel setup as given in the docs and can optionally be used.