EDIT: This isn’t a docker issue, the slowdown is still there when I install outside the container.
Not much of a docker (or pytorch) expert but I have a text-to-speech codebase where I am shoe horning a general transformation (Griffin-Lim) into a torch.nn.Module. In one machine, runs on the order of a second (ignoring startup). This code has been put into a docker image on a host with faster CPU and GPUs and I get an immense slowdown (1s to 18s).
I’ve profiled, and if one compares appropriately (a little bothered by why the number of calls isn’t the exact same), it looks like the culprit is the calls to the .cuda() method (3s on fast machine total vs. 21s on slow machine, pasted at end).
I’m currently using .cuda() as opposed to .to(“cuda”/device). Is there something obviously idiotic I am doing that would result in such a slowdown in the docker image? The python versions between the machines differ slightly (fast machine = 3.6.3, slow machine = 3.6.7) but profile environment summary matches:
Environment Summary
PyTorch 0.4.1 compiled w/ CUDA 9.0.176
Running with Python 3.6 and CUDA 9.0.176
Fast machine:
cProfile output
1299411 function calls (1275111 primitive calls) in 17.041 seconds
Ordered by: internal time
List reduced from 7173 to 15 due to restriction <15>
ncalls tottime percall cumtime percall filename:lineno(function)
1 9.531 9.531 9.531 9.531 /lib/python3.6/site-pa
ckages/numpy/linalg/linalg.py:1299(svd)
324 3.486 0.011 3.486 0.011 {method ‘cuda’ of ‘torch._C._TensorBase’ objects}
…
Slow machine:
cProfile output
1311902 function calls (1287710 primitive calls) in 28.400 seconds
Ordered by: internal time
List reduced from 7152 to 15 due to restriction <15>
ncalls tottime percall cumtime percall filename:lineno(function)
385 21.639 0.056 21.639 0.056 {method ‘cuda’ of ‘torch._C._TensorBase’ objects}
1 3.791 3.791 3.791 3.791 /usr/local/lib/python3.6/dist-packages/numpy/linalg/linalg.py:1299(svd)
…