Why would SyncBatchNorm give different results from BatchNorm?

I have a model that reliably trains to some performance without DDP with a batch size of 2n. I enable DDP, call SyncBatchNorm.convert_sync_batchnorm, use the DistributedSampler, change my batch size to n, and train on two gpus. I get significantly worse results. I wonder why. I take a look at SyncBatchNorm and I wonder if it gives the same results as BatchNorm and I find that it doesn’t. I wonder what I should think about this?

PyTorch 1.11 built from source. Cuda 11. RTX 3090s.


import os
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.multiprocessing as mp

from torch.nn.parallel import DistributedDataParallel as DDP

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'

    # initialize the process group
    dist.init_process_group("gloo", rank=rank, world_size=world_size)

def cleanup():

def demo_basic(rank, world_size):
    setup(rank, world_size)

    model = nn.SyncBatchNorm(2).to(rank)
    ddp_model = DDP(model, device_ids=[rank])

    values = torch.load('values.pt')
    values = values.view(2, 1, 2, 4)
    values = values[rank].to(rank)

    out = ddp_model(values)
    torch.save(out.detach(), f'out{rank}.pt')

def run_ddp(demo_fn, world_size):

if __name__ == "__main__":
    means = [1, 2, 3, 4]
    stds = [1, 2, 3, 4]

    values = [
        torch.normal(m, s, size=(1, 1, 4))
        for m, s in zip(means, stds)
    values = torch.cat(values, dim=1)  # (1, 4, 256)
    values = values.view(2, 2, 4)

    model = nn.BatchNorm1d(2).cuda()
    out = model(values.cuda())
    torch.save(values, 'values.pt')

    run_ddp(demo_basic, 2)

    out = out.detach().cpu()
    out_rank0 = torch.load('out0.pt', map_location={'cuda:0': 'cpu'})
    out_rank1 = torch.load('out1.pt', map_location={'cuda:1': 'cpu'})

    out_ddp = torch.cat((out_rank0, out_rank1), dim=0)


    diff = out - out_ddp



tensor([[[-0.5313, -0.8621, -0.7267, -0.8753],
         [ 0.1433, -1.0897,  1.1616,  1.2525]],

        [[ 1.0406,  1.7617,  0.9555, -0.7623],
         [-1.6836, -0.1252, -0.5245,  0.8654]]])
tensor([[[-0.5313, -0.8621, -0.7267, -0.8753],
         [ 0.1433, -1.0897,  1.1616,  1.2525]],

        [[ 1.0406,  1.7617,  0.9555, -0.7623],
         [-1.6836, -0.1252, -0.5245,  0.8654]]])
tensor([[[-5.9605e-08, -5.9605e-08, -5.9605e-08, -5.9605e-08],
         [ 7.4506e-08, -1.1921e-07,  1.1921e-07,  2.3842e-07]],

        [[-1.1921e-07,  1.1921e-07, -1.1921e-07, -5.9605e-08],
         [-1.1921e-07,  2.9802e-08,  0.0000e+00,  1.7881e-07]]])

Notice the repeated values in the difference? I wonder if there is some numerical imprecision happening here? I am speaking as someone who was recently burned by TensorFloat32 silently doing half precision matrix multiplies.

Regarding worse results, could you try setting:

torch.backends.cudnn.enabled = False

Per a few resources such as Training performance degrades with DistributedDataParallel - #32 by dabs, this appears to help accuracy/convergence related issues.

Furthermore, the CuDNN backend is known to be nondeterministic, see for example Batchnorm gives different results depending on whether cudnn is enabled · Issue #8283 · pytorch/pytorch · GitHub. Could you try to set torch.backends.cudnn.deterministic = True to help understand if that results in equivalent outputs?