dataloader = ....(create ddp loader with ddp settings)
opts = ...parse() # user options
master = opts.local_rank == 0
model = create_model(opt)
model_ema = model.clone().eval() # keeping track of exponential moving average for model's weights
for data in dataloader():
# typical training code ... forward, backward and the likes
update_ema_weights(model_ema, model.state_dict()) # update the weights for model's team
if opt.validate:
if master:
for data in valid_dataloader():
output = model_ema(data).... # typical validate code
torch.cuda.synchronize()
given the above pseudo-code, after validation, my DDP process will hang on all GPUs.
However, if I use model instead of model_ema for validation, it will not. Does anyone know how to fix this?