Multiple GPUs training is different than Single GPU training

Hello everyone,

I use the Residual Flows network repository to train the network on the CIFAR-10 dataset. But, when I use 4 GPUs to speed up the training, generated fake images are very noisy. It seems the network isn’t trained in this way, however using 1 GPU the generated images are realistic.

One GPU:

Found 1 CUDA devices.
GeForce RTX 2080 Ti Memory: 10.76GB
Current LR 0.001
/home/hkeshvarik/residual-flows/lib/optimizers.py:88: UserWarning: This overload of add_ is deprecated:
add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
add_(Tensor other, *, Number alpha) (Triggered internally at …/torch/csrc/utils/python_arg_parser.cpp:1050.)
exp_avg.mul_(beta1).add_(1 - beta1, grad)
Epoch: [0][0/3125] | Time 3.300 | GradNorm 0.78 | Bits/dim 7.8337(7.8337) | Logpz -4263 | -DeltaLogp 4617 | EstMoment (1,20)
Epoch: [0][20/3125] | Time 1.873 | GradNorm 1.13 | Bits/dim 7.7749(7.8731) | Logpz -4344 | -DeltaLogp 4615 | EstMoment (1,19)
Epoch: [0][40/3125] | Time 1.814 | GradNorm 1.54 | Bits/dim 7.4018(7.7304) | Logpz -4063 | -DeltaLogp 4637 | EstMoment (17,46)
Epoch: [0][60/3125] | Time 1.856 | GradNorm 2.05 | Bits/dim 7.1400(7.5244) | Logpz -3775 | -DeltaLogp 4787 | EstMoment (84,475)
Epoch: [0][80/3125] | Time 1.886 | GradNorm 2.37 | Bits/dim 6.7452(7.2833) | Logpz -3603 | -DeltaLogp 5129 | EstMoment (255,3357)
Epoch: [0][100/3125] | Time 1.878 | GradNorm 2.57 | Bits/dim 6.3381(6.9521) | Logpz -3506 | -DeltaLogp 5738 | EstMoment (640,18912)
Epoch: [0][120/3125] | Time 1.948 | GradNorm 2.50 | Bits/dim 5.6473(6.4880) | Logpz -3546 | -DeltaLogp 6766 | EstMoment (1397,81412)
Epoch: [0][140/3125] | Time 2.003 | GradNorm 3.00 | Bits/dim 5.2198(5.9421) | Logpz -3730 | -DeltaLogp 8112 | EstMoment (2454,220792)
Epoch: [0][160/3125] | Time 1.907 | GradNorm 3.30 | Bits/dim 4.7968(5.5149) | Logpz -3933 | -DeltaLogp 9224 | EstMoment (3331,365044)
Epoch: [0][180/3125] | Time 1.906 | GradNorm 2.90 | Bits/dim 4.6803(5.2312) | Logpz -4101 | -DeltaLogp 9997 | EstMoment (3929,477682)
Epoch: [0][200/3125] | Time 1.966 | GradNorm 2.49 | Bits/dim 5.0990(5.0599) | Logpz -4239 | -DeltaLogp 10499 | EstMoment (4277,546810)

Quad GPUs:

Found 4 CUDA devices.
GeForce RTX 2080 Ti Memory: 10.76GB
GeForce RTX 2080 Ti Memory: 10.76GB
GeForce RTX 2080 Ti Memory: 10.76GB
GeForce RTX 2080 Ti Memory: 10.76GB

Current LR 0.001
/home/hkeshvarik/residual-flows/lib/optimizers.py:88: UserWarning: This overload of add_ is deprecated:
add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
add_(Tensor other, *, Number alpha) (Triggered internally at …/torch/csrc/utils/python_arg_parser.cpp:1050.)
exp_avg.mul_(beta1).add_(1 - beta1, grad)
Epoch: [0][0/782] | Time 17.099 | GradNorm 0.25 | Bits/dim 7.8317(7.8317) | Logpz -4263 | -DeltaLogp 4621 | EstMoment (-0,14)
Epoch: [0][20/782] | Time 2.413 | GradNorm 0.59 | Bits/dim 8.0066(7.9317) | Logpz -4458 | -DeltaLogp 4603 | EstMoment (-1,15)
Epoch: [0][40/782] | Time 2.473 | GradNorm 0.70 | Bits/dim 8.0849(7.9832) | Logpz -4525 | -DeltaLogp 4561 | EstMoment (-1,15)
Epoch: [0][60/782] | Time 2.451 | GradNorm 0.63 | Bits/dim 7.9804(8.0029) | Logpz -4509 | -DeltaLogp 4503 | EstMoment (-1,15)
Epoch: [0][80/782] | Time 2.384 | GradNorm 0.57 | Bits/dim 8.0048(8.0065) | Logpz -4472 | -DeltaLogp 4458 | EstMoment (-1,15)
Epoch: [0][100/782] | Time 2.334 | GradNorm 0.58 | Bits/dim 8.0372(8.0153) | Logpz -4486 | -DeltaLogp 4454 | EstMoment (-1,15)

Any suggestion?

It’s a bit unclear what setup you are using so some guesses.
Are you using nn.DataParallel or DistributedDataParallel? In the latter case, are you using SyncBatchNorm layers? Is the batch size the same on each device using the single GPU run vs. the multi-GPU one? If so, this would mean that you are basically scaling up the global batch size by 4x and might not expect to see the same results.

In the code, they use nn.DataParallel, and they don’t use SyncBatchNorm layer. I use batchsize 16 for two configurations. The printed losses aren’t important, I care about the generated fake samples, which indicate the model capability.


model = ResidualFlow(
    input_size,
    n_blocks=list(map(int, args.nblocks.split('-'))),
    intermediate_dim=args.idim,
    factor_out=args.factor_out,
    quadratic=args.quadratic,
    init_layer=init_layer,
    actnorm=args.actnorm,
    fc_actnorm=args.fc_actnorm,
    batchnorm=args.batchnorm,
    dropout=args.dropout,
    fc=args.fc,
    coeff=args.coeff,
    vnorms=args.vnorms,
    n_lipschitz_iters=args.n_lipschitz_iters,
    sn_atol=args.sn_tol,
    sn_rtol=args.sn_tol,
    n_power_series=args.n_power_series,
    n_dist=args.n_dist,
    n_samples=args.n_samples,
    kernels=args.kernels,
    activation_fn=args.act,
    fc_end=args.fc_end,
    fc_idim=args.fc_idim,
    n_exact_terms=args.n_exact_terms,
    preact=args.preact,
    neumann_grad=args.neumann_grad,
    grad_in_forward=args.mem_eff,
    first_resblock=args.first_resblock,
    learn_p=args.learn_p,
    classification=args.task in ['classification', 'hybrid'],
    classification_hdim=args.cdim,
    n_classes=n_classes,
    block_type=args.block,
)

model.to(device)
ema = utils.ExponentialMovingAverage(model)


def parallelize(model):
       return torch.nn.DataParallel(model)


def train(epoch, model):

    model = parallelize(model)
    model.train()
    .
    .
    .
    .
    .

Multi GPU result:
e000_i001000@4 GPU@

The middle row shows generated fake images!