[DPP] define optimizer before DPP or after DPP (plus amp)

Hi, a question about the appropriate ordering of defining optimizer in DPP + amp scenario.
Define optimizer before DPP ( torch.nn.parallel.DistributedDataParallel) :

optimizer = torch.optim.AdamW(model.parameters(), lr=0.01)
model, optimizer = amp.initialize(model, optimizer,  opt_level='O2')
model = DDP(model)

define optimizer after DPP :

model = DDP(model)
optimizer = torch.optim.AdamW(model.parameters(), lr=0.01)
model, optimizer = amp.initialize(model, optimizer,  opt_level='O2')

The official video classification script adopts the first logic.
Yet I see some blog says that you should initialize an optimizer over a DPP model when you use DPP (though no amp context involved in that article).
I wonder which one is correct. or dosen’t matter, both are fine. Thanks.

1 Like

DDP does not change model.parameters(), and the optimizer works entirely locally, so defining the optimizer before or after wrapping model with DDP should not make a difference.

7 Likes