My question is
What are the differences between Apex distributed data-parallel with torch ddp since mixed precision training is also inside pytorch now is there any specific reason we use apex ddp in stead of torch ddp?
The apex implementations are deprecated, since they are now supported in PyTorch via their native implementations, so you should not use apex/DDP or apex/AMP anymore.
This post explains it in more detail.
Hi, I am using pytorch ‘1.10.0+cu113’.
The DDP model can work properly in apex DDP, but it causes deadlock in torch DDP after I save the model.
I have ensured that the model only is saved by local=0.
Does this mean that you are able to successfully save the model, but the next iteration hangs?
Could you add print statements to the code to check where exactly the script is hanging?
Thanks for your reply. I have solved this problem. It is caused that I have run a partial dataloader only in local_rank=0 for a temporary evaluation. It seems that all dataloaders in different processes must be in the same state.