Why is torch-1.10.0+cu113 much larger than torch-1.10.0

I am deploying our ML models in a docker container, and we try to reduce the size of this docker image. I notice that torch-1.10.0+cu113 is more than 1GB larger than torch-1.10.0. The main difference is caused by torch/lib/libtorch_cuda_cpp.so, which only exists in torch-1.10.0+cu113. What’s the use of this file and can we not have it but still make it works with CUDA 11.3?

1 Like

The GPU kernels are separate in PyTorch precisely for the reason you’re describing: they add a substantial amount of footprint to the library size. This file contains all the CUDA kernels for GPU execution; without them there wouldn’t be any native CUDA support.

Thanks for your reply! But torch-1.10.0 also has GPU support (for cuda 10.2)

In torch-1.10.0-cu113, there are:

libtorch_cuda.so 
libtorch_cuda_cpp.so
libtorch_cpu.so

In torch-1.10.0 (which works for cu102), there are:

libtorch_cuda.so 
libtorch_cpu.so

So CUDA kernels are contained in libtorch_cuda.so. So what is the difference between libtorch_cuda.so and libtorch_cuda_cpp.so?

Ah, sorry, I misread your question. I think the difference might be that CUDA 11 will support more GPU architectures; there are corresponding kernels for the newer architectures with the newer CUDA version e.g.,

As far as the file split, I think that might be just an artifact of a tweak to the build process, but I’m not very knowledgeable on the details here.

Thank you for your reply! libtorch_cuda_cpp.so is taking more than 2.5GB of space. I tried deleting it and so far the code runs without any problem. Very interesting.

(see more potential details here):

It might be that you’re not hitting any of the functions defined there…

Since torch-1.10.0 with cuda10.2 doesn’t have libtorch_cuda_cpp.so and still works fine on GPU, I suppose libtorch_cuda_cpp.so is only needed by caffe2.

That doesn’t sound quite right… if you take a look at the file it’s going over sources file in aten so that warrants some caution

No, the caffe2 bits are built in libcaffe2_...so. The library splitting was introduced to avoid the relocation issue introduced by the large library size during linking due to the size increase in CUDA, cuDNN etc.
The libtorch_cuda_cpp.so contains symbols for cuDNN, NCCL etc. Also, if I delete it, I’m unable to import torch and get:

ImportError: libtorch_cuda_cpp.so: cannot open shared object file: No such file or directory

so unsure what your workflow is.
TL;DR: don’t delete it, as this lib is needed.

Not sure whether this adds any value, but it might help looking at how pytorch gets installed from source when it comes to questions about which uses which in pytorch. See I cannot use the pytorch that was built successfully from source: (DLL) initialization routine failed. Error loading caffe2_detectron_ops_gpu.dll. At least, there, the same players take part. Just a side note, I guess that this is not useful in the end.