DDP Model Parameters different from Single-process Model Parameters

csl · March 11, 2022, 6:28pm

I am currently running two versions of the same model (DDP and single GPU) and am trying to ensure the model parameters are the same between the two. To do so, I load a checkpoint at the first epoch of both models and run it for several iterations. When I load the checkpoint the initial parameters are the same (DDP is running two processes in this example):

DDP Model Output
Rank: 0
model.state_dict():  OrderedDict([('module.module.input_networks.0.0.net.net.0.weight', tensor([[ 0.3889,  0.1429,  0.1801,  0.2079],
        [ 0.4926, -0.3095, -0.1581, -0.1851],
        [ 0.2123, -0.4082, -0.3036,  0.1350],
        [ 0.3666,  0.2588,  0.3510, -0.4564],
Rank: 1
OrderedDict([('module.module.input_networks.0.0.net.net.0.weight', tensor([[ 0.3889,  0.1429,  0.1801,  0.2079],
        [ 0.4926, -0.3095, -0.1581, -0.1851],
        [ 0.2123, -0.4082, -0.3036,  0.1350],
        [ 0.3666,  0.2588,  0.3510, -0.4564],

Single GPU Model Output
model.state_dict():  OrderedDict([('input_networks.0.0.net.net.0.weight', tensor([[ 0.3889,  0.1429,  0.1801,  0.2079],
        [ 0.4926, -0.3095, -0.1581, -0.1851],
        [ 0.2123, -0.4082, -0.3036,  0.1350],
        [ 0.3666,  0.2588,  0.3510, -0.4564],

However, once I run them both for 1 iteration, the parameters differ (by 0.001 to 0.002) and the outputs look like this:

DDP Model Output
Rank 0: 
model.state_dict():  OrderedDict([('input_networks.0.0.net.net.0.weight', tensor([[ 0.3886,  0.1431,  0.1805,  0.2080],
        [ 0.4927, -0.3096, -0.1583, -0.1849],
        [ 0.2124, -0.4080, -0.3040,  0.1352],
        [ 0.3663,  0.2588,  0.3514, -0.4563],
Rank 1: 
model.state_dict():  OrderedDict([('input_networks.0.0.net.net.0.weight', tensor([[ 0.3886,  0.1431,  0.1805,  0.2080],
        [ 0.4927, -0.3096, -0.1583, -0.1849],
        [ 0.2124, -0.4080, -0.3040,  0.1352],
        [ 0.3663,  0.2588,  0.3514, -0.4563],

Single GPU Model Output
model.state_dict():  OrderedDict([('module.module.input_networks.0.0.net.net.0.weight', tensor([[ 0.3888,  0.1430,  0.1802,  0.2078],
        [ 0.4925, -0.3094, -0.1582, -0.1850],
        [ 0.2124, -0.4081, -0.3037,  0.1351],
        [ 0.3665,  0.2587,  0.3511, -0.4565],

Does anyone have experience with why this may be happening? My model does not have any dropout layers and should be updating the same way.

rvarm1 · March 15, 2022, 6:21pm

Hi, for DDP versus local training, is the following true for your model?

Learning rate is consistent across distributed and local training
If your batch size is M locally, then a local batch size of M / N across N workers with average loss should result in equivalent training to local.

By “iteration”, are you referring to forward + backward + optimizer step? Before the optimizer step, it may also be useful to evaluate model gradients to ensure that they are the same across local and distributed training.