50% inference latency increase on V100 between 1.5 and 1.6

Hi,

I am working on benchmarking models for a future update in production, and I noticed a considerable speed reduction between 1.2 (the version we currently use) and 1.7 (the version we wish to use).

I ran tests with 1.2, 1.3, 1.4, 1.5, 1.6 and 1.7, always with CUDA 10.2 and cuDNN 7.6. PyTorch was install ed with pip3 on a Ubuntu 18.04 running the AWS Deep Learning AMI on a p3.2x which uses a V100-SMX2. Installed CUDA version is 11, driver version is 450.80.02.

Now for the actual numbers, 1.2 to 1.5 clock in at about 80ms/batch, 1.6 does 147ms/batch and 1.7 does 120ms/batch. All measurements were made with torch.cuda.synchronize() after each iteration for 100 iterations. Timings above were measured with time.time().

Unfortunately I cannot distribute the actual code for the model, but it’s a pretty standard feature extractor with an attention layer on top.

Finally, when benchmarking with the PyTorch profiler, the numbers are very different (and make more sense):

  • PyTorch 1.2: CPU time: 97ms CUDA time: 315ms
  • PyTorch 1.7: CPU time: 130ms CUDA time: 131ms

Has anyone encountered a similar issue, a <10% increase would be fine but this is +50%?

Thank you
Edouard

To get a better picture which part of the code is slower, you could profile it via e.g. Nsight Systems as described here. This would give you the timeline and you could compare your workloads to each other.

I tried profiling with Nsight, the only different is in gpukernsum. I see that PyTorch 1.7 takes more time to run “volta_sgemm_128x64_nn” than PyTorch 1.2, but I am not sure that I understand the root cause.

PyTorch 1.7:

Time(%) Total Time (ns) Instances Average Minimum Maximum Name


37.7        196761876        240  819841.2   204254  1039382  volta_sgemm_128x64_nn
11.3         58673890       2720   21571.3    20671    26816  volta_sgemm_64x32_sliced1x4_nn
 8.4         43656087        200  218280.4   203550   396508  volta_sgemm_128x64_tn
 7.8         40924422       1280   31972.2    31520    40352  volta_sgemm_128x32_sliced1x4_nn
 4.1         21359458        250   85437.8     3520   141279  void cudnn::winograd_nonfused::winogradForwardOutput4x4<float, float>(cudnn::winograd_nonfused::Win…
 4.1         21252334        250   85009.3    35775   162527  void cudnn::winograd_nonfused::winogradForwardData4x4<float, float>(cudnn::winograd_nonfused::Winog…

PyTorch 1.2

Time(%) Total Time (ns) Instances Average Minimum Maximum Name


35.2        167744736        240  698936.4   174911   886457  volta_sgemm_128x64_nn
11.7         55695755       2720   20476.4    15456    61119  volta_sgemm_64x32_sliced1x4_nn
 8.1         38854146       1280   30354.8    27200    83679  volta_sgemm_128x32_sliced1x4_nn
 6.9         32906676        190  173193.0   169790   186815  volta_sgemm_128x64_tn
 4.5         21690195        250   86760.8    35840   164703  void cudnn::winograd_nonfused::winogradForwardData4x4<float, float>(cudnn::winograd_nonfused::Winog…
 4.5         21461298        250   85845.2     3264   140415  void cudnn::winograd_nonfused::winogradForwardOutput4x4<float, float>(cudnn::winograd_nonfused::Win…

The difference sums up to about 66 ms for 10 batches or 7ms/batch which is significant but still far from the 40ms/batch measured in wall clock time. One interesting fact is that running the script with nsight takes almost 2 full seconds longer with PyTorch 1.7 (6.98) than 1.2 (5.74). So I would venture that whatever is causing the discrepancy is also slowing down Nsight.

When looking at the iterations themselves, we can see that there is an initial spike but the PyTorch 1.2 will settle at a lower latency than PyTorch 1.7.

PyTorch 1.7

Time(%) Total Time (ns) Instances Average Minimum Maximum Range


51.6       3599625989          1  3599625989.0  3599625989  3599625989  infer 0
14.8       1030896235          1  1030896235.0  1030896235  1030896235  infer 1
 4.3        301591566          1   301591566.0   301591566   301591566  infer 2
 4.2        293810608          1   293810608.0   293810608   293810608  infer 5
 4.2        293558485          1   293558485.0   293558485   293558485  infer 6
 4.2        293256528          1   293256528.0   293256528   293256528  infer 8
 4.2        292517955          1   292517955.0   292517955   292517955  infer 9
 4.2        292325983          1   292325983.0   292325983   292325983  infer 3
 4.2        291994829          1   291994829.0   291994829   291994829  infer 4
 4.2        291794244          1   291794244.0   291794244   291794244  infer 7

PyTorch 1.2

Time(%) Total Time (ns) Instances Average Minimum Maximum Range


62.7       3610094147          1  3610094147.0  3610094147  3610094147  infer 0
 4.2        240599239          1   240599239.0   240599239   240599239  infer 6
 4.2        240580063          1   240580063.0   240580063   240580063  infer 5
 4.2        239375786          1   239375786.0   239375786   239375786  infer 2
 4.1        238719644          1   238719644.0   238719644   238719644  infer 9
 4.1        238714962          1   238714962.0   238714962   238714962  infer 3
 4.1        238412565          1   238412565.0   238412565   238412565  infer 7
 4.1        238015374          1   238015374.0   238015374   238015374  infer 1
 4.1        237976813          1   237976813.0   237976813   237976813  infer 8
 4.1        237764668          1   237764668.0   237764668   237764668  infer 4

The difference here is about 51 ms/batch if we ignore the first inferences. That is very similar to the first number observed of 47 ms/batch.

Do these help? I feel like they confirm the issue but I don’t think it really helps with troubleshooting?

Doing my mea culpa hopefully so it can help someone else at some point:

The wall clock time was measured inside a profiler context. It’s the profiler context that was slowing down the overall inference and not the PyTorch version itself (PyTorch 1.2 profiler is faster than PyTorch 1.7 profiler somehow). Needless to say measuring wall clock time with the profiler active is not desirable anyway.

TLDR: No performance regression as far as I can tell, just don’t add the profiler overhead to your measurements. :sweat_smile: