Hi, a question about the appropriate ordering of defining optimizer in DPP + amp scenario.
Define optimizer before DPP ( torch.nn.parallel.DistributedDataParallel
) :
optimizer = torch.optim.AdamW(model.parameters(), lr=0.01)
model, optimizer = amp.initialize(model, optimizer, opt_level='O2')
model = DDP(model)
define optimizer after DPP :
model = DDP(model)
optimizer = torch.optim.AdamW(model.parameters(), lr=0.01)
model, optimizer = amp.initialize(model, optimizer, opt_level='O2')
The official video classification script adopts the first logic.
Yet I see some blog says that you should initialize an optimizer over a DPP model when you use DPP (though no amp context involved in that article).
I wonder which one is correct. or dosen’t matter, both are fine. Thanks.