[DPP] define optimizer before DPP or after DPP (plus amp)

Hi, a question about the appropriate ordering of defining optimizer in DPP + amp scenario.
Define optimizer before DPP ( torch.nn.parallel.DistributedDataParallel) :

optimizer = torch.optim.AdamW(model.parameters(), lr=0.01)
model, optimizer = amp.initialize(model, optimizer,  opt_level='O2')
model = DDP(model)

define optimizer after DPP :

model = DDP(model)
optimizer = torch.optim.AdamW(model.parameters(), lr=0.01)
model, optimizer = amp.initialize(model, optimizer,  opt_level='O2')

The official video classification script adopts the first logic.
Yet I see some blog says that you should initialize an optimizer over a DPP model when you use DPP (though no amp context involved in that article).
I wonder which one is correct. or dosen’t matter, both are fine. Thanks.

DDP does not change model.parameters(), and the optimizer works entirely locally, so defining the optimizer before or after wrapping model with DDP should not make a difference.

5 Likes