DDP Model Parameters different from Single-process Model Parameters

I am currently running two versions of the same model (DDP and single GPU) and am trying to ensure the model parameters are the same between the two. To do so, I load a checkpoint at the first epoch of both models and run it for several iterations. When I load the checkpoint the initial parameters are the same (DDP is running two processes in this example):

DDP Model Output
Rank: 0
model.state_dict():  OrderedDict([('module.module.input_networks.0.0.net.net.0.weight', tensor([[ 0.3889,  0.1429,  0.1801,  0.2079],
        [ 0.4926, -0.3095, -0.1581, -0.1851],
        [ 0.2123, -0.4082, -0.3036,  0.1350],
        [ 0.3666,  0.2588,  0.3510, -0.4564],
Rank: 1
OrderedDict([('module.module.input_networks.0.0.net.net.0.weight', tensor([[ 0.3889,  0.1429,  0.1801,  0.2079],
        [ 0.4926, -0.3095, -0.1581, -0.1851],
        [ 0.2123, -0.4082, -0.3036,  0.1350],
        [ 0.3666,  0.2588,  0.3510, -0.4564],

Single GPU Model Output
model.state_dict():  OrderedDict([('input_networks.0.0.net.net.0.weight', tensor([[ 0.3889,  0.1429,  0.1801,  0.2079],
        [ 0.4926, -0.3095, -0.1581, -0.1851],
        [ 0.2123, -0.4082, -0.3036,  0.1350],
        [ 0.3666,  0.2588,  0.3510, -0.4564],

However, once I run them both for 1 iteration, the parameters differ (by 0.001 to 0.002) and the outputs look like this:

DDP Model Output
Rank 0: 
model.state_dict():  OrderedDict([('input_networks.0.0.net.net.0.weight', tensor([[ 0.3886,  0.1431,  0.1805,  0.2080],
        [ 0.4927, -0.3096, -0.1583, -0.1849],
        [ 0.2124, -0.4080, -0.3040,  0.1352],
        [ 0.3663,  0.2588,  0.3514, -0.4563],
Rank 1: 
model.state_dict():  OrderedDict([('input_networks.0.0.net.net.0.weight', tensor([[ 0.3886,  0.1431,  0.1805,  0.2080],
        [ 0.4927, -0.3096, -0.1583, -0.1849],
        [ 0.2124, -0.4080, -0.3040,  0.1352],
        [ 0.3663,  0.2588,  0.3514, -0.4563],

Single GPU Model Output
model.state_dict():  OrderedDict([('module.module.input_networks.0.0.net.net.0.weight', tensor([[ 0.3888,  0.1430,  0.1802,  0.2078],
        [ 0.4925, -0.3094, -0.1582, -0.1850],
        [ 0.2124, -0.4081, -0.3037,  0.1351],
        [ 0.3665,  0.2587,  0.3511, -0.4565],

Does anyone have experience with why this may be happening? My model does not have any dropout layers and should be updating the same way.

Hi, for DDP versus local training, is the following true for your model?

  1. Learning rate is consistent across distributed and local training
  2. If your batch size is M locally, then a local batch size of M / N across N workers with average loss should result in equivalent training to local.

By “iteration”, are you referring to forward + backward + optimizer step? Before the optimizer step, it may also be useful to evaluate model gradients to ensure that they are the same across local and distributed training.