Hey folks,
after upgrading to torch==2.0
yesterday, I have found that I am no longer able to run torch programs if the system doesn’t have CUDA.
Here’s my observation on the various distributions:
# PyTorch 2.0 for use with CUDA, CUDA libs get installed as pip deps if unavailable on the system. Wheel size w/o deps: ~620MB
pip3 install torch==2.0
# PyTorch 2.0 with bundled CUDA 11.7. Wheel size w/o deps: 1.8GB
pip3 install torch==2.0+cu117 --extra-index-url https://download.pytorch.org/whl/cu117
# PyTorch 2.0 w/o CUDA support. Wheel size w/o deps: 195MB
pip3 install torch==2.0+cpu --extra-index-url https://download.pytorch.org/whl/cpu
If I now look at what got installed from the first option (site-packages/torch/lib
), I see, among other things:
-rwxrwxr-x 1 ubuntu ubuntu 487M Apr 11 08:33 libtorch_cpu.so
-rwxrwxr-x 1 ubuntu ubuntu 627M Apr 11 08:33 libtorch_cuda.so
so my expectation would be that this distribution allows me to use Torch with or without CUDA support.
However, in reality, import torch
fails on a non-CUDA system (w/o the CUDA pip deps installed), because ldd libtorch_global_deps.so
shows that the global deps library – which is unconditionally loaded at package import time – is linked against a bunch of CUDA libraries (libcublas.so
, libcurand.so
and others), which then fails to load on a non-CUDA system.
This is apparently different from the behavior in torch==1.11.0
(previous version I was using). Here, I also see
-rwxrwxr-x 1 ubuntu ubuntu 433M Apr 11 09:51 libtorch_cpu.so
-rwxrwxr-x 1 ubuntu ubuntu 994M Apr 11 09:50 libtorch_cuda.so
in the lib
folder of the package, and I can indeed use CUDA on a CUDA-system, but the libtorch_global_deps.so
does not link against any CUDA libraries:
$ ldd venv/lib/python3.9/site-packages/torch/lib/libtorch_global_deps.so | grep cuda
$
Does anyone have any insight into why this change was made? It makes it much much harder to use a consistent set of dependencies on various systems/architectures.
Now, this doesn’t matter if installing the right version of torch
is regarded as a responsibility of the system, but in our case, we use bazel
as our toolchain and thus require some level of hermetic homogeneity between building our production containers (using a base image with CUDA system libs) and, e.g., running basic functional tests in CI (on runners that don’t have GPU and where installing CUDA is a waste of time and space). Concretely, we don’t want to install over a GB of CUDA libraries as a Python dependency, because shipping them in a base image layer is more efficient. And we don’t want to install CUDA in the image used for our CI runner, because it will never have GPUs. But we do want to be able to run bazel test
on a GPU-enabled, CUDA-enabled Linux system, have Torch use CUDA, without bazel
resolving dependencies differently from CI under the hood.
I guess we could somehow address this at the Bazel level if there’s no other way, but why do we now have to, when in torch 1.11 we didn’t, while still enjoying CUDA support?