I did read that PyTorch is not supporting the so called sync BatchNorm. This is needed to
train on multi GPU machines. By question is: Are there any plans to implement sync BatchNorm
for PyTorch and when will it be released?
An other question: What is the best workaround when you want to train with images and need
large batch sizes?
SyncBatchNorm is already in PyTorch.
Hi @ptrblck ,
thanks for the answer.
The documentation sais:
Currently SyncBatchNorm only supports DistributedDataParallel with single GPU per process.
This “single GPU” kind of irritates me. What does it mean?
I am also asking because detectron2 still uses “FrozenBatchNorm2d”: https://github.com/facebookresearch/detectron2/blob/master/detectron2/modeling/backbone/resnet.py#L50
DistributedDataParallel can be used in two different setups as given in the docs.
- Single-Process Multi-GPU and
- Multi-Process Single-GPU, which is the fastest and recommended way.
SyncBatchNorm will only work in the second approach.
I’m not sure, if you would need
FrozenBatchNorm seems to fix all buffers:
BatchNorm2d where the batch statistics and the affine parameters are fixed.
It contains non-trainable buffers called
“weight” and “bias”, “running_mean”, “running_var”,
initialized to perform identity transformation.
How do I create my DDP model if I’m working on a cluster with multiple nodes and each node may have multiple GPUs ?
I think this tutorial might be a good introduction to the different backends etc.
hi, i want to know the difference between sycBN and BN.
SyncBatchNorm synchronizes the statistics during training in a
DistributedDataParallel setup as given in the docs and can optionally be used.