I figure out the reason why my training will hang. It is due to early stop.
Sometimes two out of four GPUs will early stop and exit while the other two GPUs are still training. At the graident synch step, the training will hang since the gradient from the two GPUs who left the party are not there.
Also found a solution for adding early stop in DDP training via the post below.