I would like to ask some questions regarding the DDP code used in the torchvision
's reference example on classification. An example of using this script is given as follows, on a machine with 8 GPUs:
python -m torch.distributed.launch --nproc_per_node=8 --use_env train.py --model resnext50_32x4d --epochs 100
My first question concerns the saving and loading of checkpoints.
This is how a checkpoint is saved in the script:
checkpoint = {
'model': model_without_ddp.state_dict(),
'optimizer': optimizer.state_dict(),
'lr_scheduler': lr_scheduler.state_dict(),
'epoch': epoch,
'args': args}
utils.save_on_master(
checkpoint,
os.path.join(args.output_dir, 'model_{}.pth'.format(epoch)))
utils.save_on_master(
checkpoint,
os.path.join(args.output_dir, 'checkpoint.pth'))
But in the DDP tutorial, it seems necessary that torch.distributed.barrier()
is called somewhere:
# Use a barrier() to make sure that process 1 loads the model after process 0 saves it.
dist.barrier()
...
# Use a barrier() to make sure that all processes have finished reading the checkpoint
dist.barrier()
Why is dist.barrier()
not necessary in the above reference example?
My second question is about the validation stage.
This is how it’s done in the script:
for epoch in range(args.start_epoch, args.epochs):
if args.distributed:
train_sampler.set_epoch(epoch)
train_one_epoch(model, criterion, optimizer, data_loader, device, epoch, args.print_freq, args.apex)
lr_scheduler.step()
evaluate(model, criterion, data_loader_test, device=device)
Doesn’t this mean that the evaluate()
function is called on all the processes (i.e. all the GPUs in this case)? Shouldn’t we rather do something like this:
for epoch in range(args.start_epoch, args.epochs):
if args.distributed:
train_sampler.set_epoch(epoch)
train_one_epoch(model, criterion, optimizer, data_loader, device, epoch, args.print_freq, args.apex)
lr_scheduler.step()
if torch.distributed.get_rank() == 0: # master
evaluate(model, criterion, data_loader_test, device=device)
# save checkpoint here as well
But then, again, shouldn’t we wait, using dist.barrier()
, for all the processes to finish the computations and for the master to gather the gradients, before evaluating the model?
Thank you very much in advance for your help!