Torch distributed data-parallel vs Apex distributed data-parallel

GeorgeQ-Q · May 17, 2021, 7:13am

My question is
What are the differences between Apex distributed data-parallel with torch ddp since mixed precision training is also inside pytorch now is there any specific reason we use apex ddp in stead of torch ddp?

ptrblck · May 17, 2021, 7:26am

The apex implementations are deprecated, since they are now supported in PyTorch via their native implementations, so you should not use apex/DDP or apex/AMP anymore.
This post explains it in more detail.

c_cj · November 19, 2021, 12:40pm

Hi, I am using pytorch ‘1.10.0+cu113’.
The DDP model can work properly in apex DDP, but it causes deadlock in torch DDP after I save the model.
I have ensured that the model only is saved by local=0.

ptrblck · November 20, 2021, 8:11am

Does this mean that you are able to successfully save the model, but the next iteration hangs?
Could you add print statements to the code to check where exactly the script is hanging?

c_cj · November 23, 2021, 1:48pm

Thanks for your reply. I have solved this problem. It is caused that I have run a partial dataloader only in local_rank=0 for a temporary evaluation. It seems that all dataloaders in different processes must be in the same state.