.cuda slow on GTX Titan vs. GTX 1080 Ti

mittimithai · May 31, 2019, 1:12am

EDIT: This isn’t a docker issue, the slowdown is still there when I install outside the container.

Not much of a docker (or pytorch) expert but I have a text-to-speech codebase where I am shoe horning a general transformation (Griffin-Lim) into a torch.nn.Module. In one machine, runs on the order of a second (ignoring startup). This code has been put into a docker image on a host with faster CPU and GPUs and I get an immense slowdown (1s to 18s).

I’ve profiled, and if one compares appropriately (a little bothered by why the number of calls isn’t the exact same), it looks like the culprit is the calls to the .cuda() method (3s on fast machine total vs. 21s on slow machine, pasted at end).

I’m currently using .cuda() as opposed to .to(“cuda”/device). Is there something obviously idiotic I am doing that would result in such a slowdown in the docker image? The python versions between the machines differ slightly (fast machine = 3.6.3, slow machine = 3.6.7) but profile environment summary matches:

Environment Summary

PyTorch 0.4.1 compiled w/ CUDA 9.0.176
Running with Python 3.6 and CUDA 9.0.176

Fast machine:

cProfile output

     1299411 function calls (1275111 primitive calls) in 17.041 seconds

Ordered by: internal time
List reduced from 7173 to 15 due to restriction <15>

ncalls tottime percall cumtime percall filename:lineno(function)
1 9.531 9.531 9.531 9.531 /lib/python3.6/site-pa
ckages/numpy/linalg/linalg.py:1299(svd)
324 3.486 0.011 3.486 0.011 {method ‘cuda’ of ‘torch._C._TensorBase’ objects}
…

Slow machine:

cProfile output

     1311902 function calls (1287710 primitive calls) in 28.400 seconds

Ordered by: internal time
List reduced from 7152 to 15 due to restriction <15>

ncalls tottime percall cumtime percall filename:lineno(function)
385 21.639 0.056 21.639 0.056 {method ‘cuda’ of ‘torch._C._TensorBase’ objects}
1 3.791 3.791 3.791 3.791 /usr/local/lib/python3.6/dist-packages/numpy/linalg/linalg.py:1299(svd)

…

mittimithai · June 4, 2019, 11:15pm

So the performance gap persists when I compare outside the docker container, I didn’t realize that the difference in the hardware was the issue. The Titan is much older than the 1080 Ti.

GeForce GTX 1080 Ti
ncalls tottime percall cumtime percall filename:lineno(function)
324 8.589 0.027 8.589 0.027 {method ‘cuda’ of ‘torch._C._TensorBase’ objects}

GeForce GTX TITAN
ncalls tottime percall cumtime percall filename:lineno(function)
324 21.860 0.067 21.860 0.067 {method ‘cuda’ of ‘torch._C._TensorBase’