HI everyone. I’m experiencing an issue that I can’t explain: I have two models which produce an output tensor with the same size (8, 8192), however calling the F.mse_loss on the output of one takes less time than calling it on the other.

See this gist to run the comparison with synthetic data:

V0 Total: 1.3554 Loss forward: 0.0042 Loss backward: 0.9177
V0 Total: 0.0344 Loss forward: 0.0132 Loss backward: 0.0091
V0 Total: 0.1393 Loss forward: 0.1089 Loss backward: 0.0085
V0 Total: 0.1405 Loss forward: 0.1095 Loss backward: 0.0085
V0 Total: 0.1380 Loss forward: 0.1079 Loss backward: 0.0083
V0 Total: 0.1379 Loss forward: 0.1077 Loss backward: 0.0083
V0 Total: 0.1380 Loss forward: 0.1077 Loss backward: 0.0084
V0 Total: 0.1388 Loss forward: 0.1085 Loss backward: 0.0084
V1 Total: 0.7927 Loss forward: 0.0001 Loss backward: 0.5020
V1 Total: 0.0289 Loss forward: 0.0066 Loss backward: 0.0195
V1 Total: 0.0267 Loss forward: 0.0211 Loss backward: 0.0026
V1 Total: 0.0337 Loss forward: 0.0283 Loss backward: 0.0026
V1 Total: 0.0341 Loss forward: 0.0287 Loss backward: 0.0030
V1 Total: 0.0336 Loss forward: 0.0285 Loss backward: 0.0030
V1 Total: 0.0335 Loss forward: 0.0287 Loss backward: 0.0030
V1 Total: 0.0335 Loss forward: 0.0289 Loss backward: 0.0030

You should keep in mind that the cuda api is asynchronous, so to get proper measurement, you should always call torch.cuda.synchronize() just before calling time.time().
The difference in timing here is most certainly that one model use the gpu more and so the queuing of the operations corresponding to the loss is slightly longer.

thanks for your answer! And thanks for the fix on torch.cuda.synchronize() by adding it, the situation becomes:

V0 Total: 2.5034 Loss forward: 0.0111 Loss backward: 1.7371
V0 Total: 0.2608 Loss forward: 0.0004 Loss backward: 0.2297
V0 Total: 0.2657 Loss forward: 0.0061 Loss backward: 0.2203
V0 Total: 0.2759 Loss forward: 0.0068 Loss backward: 0.2181
V0 Total: 0.2736 Loss forward: 0.0067 Loss backward: 0.2147
V0 Total: 0.2715 Loss forward: 0.0047 Loss backward: 0.2148
V0 Total: 0.2697 Loss forward: 0.0040 Loss backward: 0.2150
V0 Total: 0.2614 Loss forward: 0.0001 Loss backward: 0.2159
V1 Total: 1.3935 Loss forward: 0.0006 Loss backward: 0.8544
V1 Total: 0.0790 Loss forward: 0.0041 Loss backward: 0.0590
V1 Total: 0.0854 Loss forward: 0.0068 Loss backward: 0.0594
V1 Total: 0.0631 Loss forward: 0.0059 Loss backward: 0.0390
V1 Total: 0.0669 Loss forward: 0.0048 Loss backward: 0.0536
V1 Total: 0.0844 Loss forward: 0.0068 Loss backward: 0.0584
V1 Total: 0.0596 Loss forward: 0.0040 Loss backward: 0.0368
V1 Total: 0.0669 Loss forward: 0.0026 Loss backward: 0.0560

The difference in the forward has disappeared. So do I understand correctly that by calling the synchronize we cancel any influence of the queuing?

The difference in the backward is still a bit puzzling, since the V0 model has 2.3x more operations (back of the envelope calculation), but the ratio here is more like 1/4.

I am not sure how to explain that. There are a lot of factors that can impact the runtime. Especially for such small batch size where the time to launch jobs on the gpu is not negligeable.

Well, in the end what I care about is the timing that takes queuing into account, that’s what I need to know which model is faster to train on my system. Thank you, it was very instructive