Loss.backward() occasionally timeout in distributed training

I used pytorch 0.4.1, and when using distributed training, I encountered a timeout problem in loss.backward(), which usually takes 1~2 seconds but sometimes 10~20 even 30+s, causing gloo to timeout, thus the training fail. Anyone know why?

I come upon the same issues using 0.4.1. What make me quite confused in that it will timeout once the second epoch of my model training begins, while the first epoch goes very smoothly. Do not what what is the reason…