I tried profiling with Nsight, the only different is in gpukernsum. I see that PyTorch 1.7 takes more time to run “volta_sgemm_128x64_nn” than PyTorch 1.2, but I am not sure that I understand the root cause.
PyTorch 1.7:
Time(%) Total Time (ns) Instances Average Minimum Maximum Name
37.7 196761876 240 819841.2 204254 1039382 volta_sgemm_128x64_nn
11.3 58673890 2720 21571.3 20671 26816 volta_sgemm_64x32_sliced1x4_nn
8.4 43656087 200 218280.4 203550 396508 volta_sgemm_128x64_tn
7.8 40924422 1280 31972.2 31520 40352 volta_sgemm_128x32_sliced1x4_nn
4.1 21359458 250 85437.8 3520 141279 void cudnn::winograd_nonfused::winogradForwardOutput4x4<float, float>(cudnn::winograd_nonfused::Win…
4.1 21252334 250 85009.3 35775 162527 void cudnn::winograd_nonfused::winogradForwardData4x4<float, float>(cudnn::winograd_nonfused::Winog…
PyTorch 1.2
Time(%) Total Time (ns) Instances Average Minimum Maximum Name
35.2 167744736 240 698936.4 174911 886457 volta_sgemm_128x64_nn
11.7 55695755 2720 20476.4 15456 61119 volta_sgemm_64x32_sliced1x4_nn
8.1 38854146 1280 30354.8 27200 83679 volta_sgemm_128x32_sliced1x4_nn
6.9 32906676 190 173193.0 169790 186815 volta_sgemm_128x64_tn
4.5 21690195 250 86760.8 35840 164703 void cudnn::winograd_nonfused::winogradForwardData4x4<float, float>(cudnn::winograd_nonfused::Winog…
4.5 21461298 250 85845.2 3264 140415 void cudnn::winograd_nonfused::winogradForwardOutput4x4<float, float>(cudnn::winograd_nonfused::Win…
The difference sums up to about 66 ms for 10 batches or 7ms/batch which is significant but still far from the 40ms/batch measured in wall clock time. One interesting fact is that running the script with nsight takes almost 2 full seconds longer with PyTorch 1.7 (6.98) than 1.2 (5.74). So I would venture that whatever is causing the discrepancy is also slowing down Nsight.
When looking at the iterations themselves, we can see that there is an initial spike but the PyTorch 1.2 will settle at a lower latency than PyTorch 1.7.
PyTorch 1.7
Time(%) Total Time (ns) Instances Average Minimum Maximum Range
51.6 3599625989 1 3599625989.0 3599625989 3599625989 infer 0
14.8 1030896235 1 1030896235.0 1030896235 1030896235 infer 1
4.3 301591566 1 301591566.0 301591566 301591566 infer 2
4.2 293810608 1 293810608.0 293810608 293810608 infer 5
4.2 293558485 1 293558485.0 293558485 293558485 infer 6
4.2 293256528 1 293256528.0 293256528 293256528 infer 8
4.2 292517955 1 292517955.0 292517955 292517955 infer 9
4.2 292325983 1 292325983.0 292325983 292325983 infer 3
4.2 291994829 1 291994829.0 291994829 291994829 infer 4
4.2 291794244 1 291794244.0 291794244 291794244 infer 7
PyTorch 1.2
Time(%) Total Time (ns) Instances Average Minimum Maximum Range
62.7 3610094147 1 3610094147.0 3610094147 3610094147 infer 0
4.2 240599239 1 240599239.0 240599239 240599239 infer 6
4.2 240580063 1 240580063.0 240580063 240580063 infer 5
4.2 239375786 1 239375786.0 239375786 239375786 infer 2
4.1 238719644 1 238719644.0 238719644 238719644 infer 9
4.1 238714962 1 238714962.0 238714962 238714962 infer 3
4.1 238412565 1 238412565.0 238412565 238412565 infer 7
4.1 238015374 1 238015374.0 238015374 238015374 infer 1
4.1 237976813 1 237976813.0 237976813 237976813 infer 8
4.1 237764668 1 237764668.0 237764668 237764668 infer 4
The difference here is about 51 ms/batch if we ignore the first inferences. That is very similar to the first number observed of 47 ms/batch.
Do these help? I feel like they confirm the issue but I don’t think it really helps with troubleshooting?