I don’t understand what the issue is. Why did your code hang - that is essential information to put in here. Did you try any of the following:
- Getting Started with Distributed Data Parallel — PyTorch Tutorials 1.10.1+cu102 documentation
- Checkpointing DDP.module instead of DDP itself - #2 by mrshenli
- Checkpointing DDP.module instead of DDP itself - #3 by Brando_Miranda
if none of them worked can you provide more details? In particular Your original post does not describe enough to know what the problem is. Things can hang for many reasons - especially in complicated multip processing code.