Using pytorch v0.4, the training speed suddenly encounter dramatic decline

Hi college:
I use pytorch v0.4 to train a detection network with LSTM, but the training speed is always unstable. For example, the time cost is 4.1845s in iter15010, but it takes 38.7737s for a forward propagation in iter15250. Is there any latent code bugs? Could you tell me some probable solutions?

iter 15010 || Loss: 3.2980, lr: 0.00001|| Timer: 4.1845 sec.
iter 15020 || Loss: 3.4412, lr: 0.00001|| Timer: 5.4093 sec.
iter 15030 || Loss: 2.2105, lr: 0.00001|| Timer: 5.8068 sec.
iter 15040 || Loss: 2.1901, lr: 0.00001|| Timer: 6.4231 sec.
iter 15050 || Loss: 2.9676, lr: 0.00001|| Timer: 6.0617 sec.
iter 15060 || Loss: 2.0652, lr: 0.00001|| Timer: 6.3771 sec.
iter 15070 || Loss: 2.3229, lr: 0.00001|| Timer: 5.5849 sec.
iter 15080 || Loss: 2.0898, lr: 0.00001|| Timer: 6.5760 sec.
iter 15090 || Loss: 2.1856, lr: 0.00001|| Timer: 6.4950 sec.
iter 15100 || Loss: 1.7462, lr: 0.00001|| Timer: 6.2931 sec.
iter 15110 || Loss: 2.6061, lr: 0.00001|| Timer: 6.8521 sec.
iter 15120 || Loss: 2.1789, lr: 0.00001|| Timer: 6.3654 sec.
iter 15130 || Loss: 3.2575, lr: 0.00001|| Timer: 6.7158 sec.
iter 15140 || Loss: 2.4897, lr: 0.00001|| Timer: 6.1295 sec.
iter 15150 || Loss: 2.1574, lr: 0.00001|| Timer: 5.6241 sec.
iter 15160 || Loss: 2.1038, lr: 0.00001|| Timer: 4.8552 sec.
iter 15170 || Loss: 3.4290, lr: 0.00001|| Timer: 6.3182 sec.
iter 15180 || Loss: 2.2228, lr: 0.00001|| Timer: 6.4761 sec.
iter 15190 || Loss: 2.3028, lr: 0.00001|| Timer: 5.8832 sec.
iter 15200 || Loss: 1.9252, lr: 0.00001|| Timer: 5.6467 sec.
iter 15210 || Loss: 2.1436, lr: 0.00001|| Timer: 6.1791 sec.
iter 15220 || Loss: 2.3701, lr: 0.00001|| Timer: 6.1901 sec.
iter 15230 || Loss: 2.7390, lr: 0.00001|| Timer: 6.4191 sec.
iter 15240 || Loss: 2.0627, lr: 0.00001|| Timer: 6.1050 sec.
iter 15250 || Loss: 2.4726, lr: 0.00001|| Timer: 38.7737 sec.
iter 15260 || Loss: 2.4106, lr: 0.00001|| Timer: 40.9049 sec.
iter 15270 || Loss: 1.9975, lr: 0.00001|| Timer: 47.4742 sec.
iter 15280 || Loss: 1.7045, lr: 0.00001|| Timer: 40.5929 sec.

Is the time constant after iter 15250?
If not, do you see the time spikes after a certain amount of iterations?

Could you provide an executable code snipped reproducing this issue, so that I could have a look at it using my machine?

It will return to ~5s for a forward, but It is difficult to find how it works. In addition, when the time cost increases, the performance (i.e., mAP) of the model will be dramatically worse.

My code is available at https://github.com/SeanChenxy/TSSD-OTA/tree/0.40, but it may take a little time to run it. I have written how to run it and produce this issue, and I think the most likely wrong code snippets are class TSSD and class seqMultiBoxLoss. If it is convenient for you, I hope you could have a look at it. Thank you very much.

@SeanChenxy please run your code with our bottleneck utility to identify slowdowns in your code: https://pytorch.org/docs/stable/bottleneck.html

1 Like

I will try it. Thank you very much.

Hi smth, I’m having amount exactly the same issue. I’m using PyTorch 0.4.0, torchvision 0.2.1 and CUDA 9.0. Running bottleneck didn’t give me a lot of information, except that convolutions took the most time. I managed to the profiler instead, and found that almost every operation took significantly longer in those slow iterations, and the forward pass slowed down more than the backward pass.

Hi Sean, did you solve the issue?

Can you check via nvidia-smi if in these “slow” iterations your GPU is being downclocked (maybe for power-draw / overheating reasons)

Hi, I’m sorry for late reply. But I have not solve it. Hope you can address it and give me some suggestions.

I checked nvidia-smi, and found that indeed it was down clocked. But it’s weird since I was running on a cluster, and everyone else’s tasks and also tasks on my other code bases were perfectly normal. I reconfigured my conda environment, and somehow now it’s occurring much much more rarely (but still occurring). I’m still not sure what had happened to it.

As in my reply to smth, I reconfigured a new conda environment, and now the problem is occurring much more rarely, thought still occurring sometimes. It’s an ugly fix and I don’t know why it worked, but that’s about what I can think of doing right now.

I reconfigured a new conda environment, and now the problem is occurring much more rarely, thought still occurring sometimes. It’s an ugly fix and I don’t know why it worked

It’s possible that the new conda environment is using slightly different CUDA / CuDNN libraries, and those dont push the GPU to overclocking territory as much. Just a possibility…

1 Like