Distributed training got stuck every few seconds

Thanks for your suggestions, and I will have a try.
I think this issue has something to do with my environment, where I observed some strange things. For example, another cuda error occurs in the same machine. Your comments about this cuda error are definitely welcome !!