Why is torch-1.10.0+cu113 much larger than torch-1.10.0

ralphmao · November 9, 2021, 5:55pm

I am deploying our ML models in a docker container, and we try to reduce the size of this docker image. I notice that torch-1.10.0+cu113 is more than 1GB larger than torch-1.10.0. The main difference is caused by torch/lib/libtorch_cuda_cpp.so, which only exists in torch-1.10.0+cu113. What’s the use of this file and can we not have it but still make it works with CUDA 11.3?

eqy · November 9, 2021, 5:59pm

The GPU kernels are separate in PyTorch precisely for the reason you’re describing: they add a substantial amount of footprint to the library size. This file contains all the CUDA kernels for GPU execution; without them there wouldn’t be any native CUDA support.

ralphmao · November 9, 2021, 6:02pm

Thanks for your reply! But torch-1.10.0 also has GPU support (for cuda 10.2)

ralphmao · November 9, 2021, 6:04pm

In torch-1.10.0-cu113, there are:

libtorch_cuda.so 
libtorch_cuda_cpp.so
libtorch_cpu.so

In torch-1.10.0 (which works for cu102), there are:

libtorch_cuda.so 
libtorch_cpu.so

So CUDA kernels are contained in libtorch_cuda.so. So what is the difference between libtorch_cuda.so and libtorch_cuda_cpp.so?

eqy · November 9, 2021, 6:07pm

Ah, sorry, I misread your question. I think the difference might be that CUDA 11 will support more GPU architectures; there are corresponding kernels for the newer architectures with the newer CUDA version e.g.,

As far as the file split, I think that might be just an artifact of a tweak to the build process, but I’m not very knowledgeable on the details here.

github.com

pytorch/pytorch/blob/9ae3f3945b519c04ef0aa4f9e441be6401292f86/CMakeLists.txt#L181


      
          option(BUILD_JNI "Build JNI bindings" OFF)
          option(BUILD_MOBILE_AUTOGRAD "Build autograd function in mobile build (in development)" OFF)
          cmake_dependent_option(
              INSTALL_TEST "Install test binaries if BUILD_TEST is on" ON
              "BUILD_TEST" OFF)
          option(USE_CPP_CODE_COVERAGE "Compile C/C++ with code coverage flags" OFF)
          option(COLORIZE_OUTPUT "Colorize output during compilation" ON)
          option(USE_ASAN "Use Address Sanitizer" OFF)
          option(USE_TSAN "Use Thread Sanitizer" OFF)
          option(USE_CUDA "Use CUDA" ON)
          # BUILD_SPLIT_CUDA must also be exported as an environment variable before building, with
          # `export BUILD_SPLIT_CUDA=1` because cpp_extension.py can only work properly if this variable
          # also exists in the environment.
          # This option is incompatible with CUDA_SEPARABLE_COMPILATION.
          cmake_dependent_option(
              BUILD_SPLIT_CUDA "Split torch_cuda library into torch_cuda_cu and torch_cuda_cpp" OFF
              "USE_CUDA AND NOT CUDA_SEPARABLE_COMPILATION" OFF)
          option(USE_FAST_NVCC "Use parallel NVCC build" OFF)
          option(USE_ROCM "Use ROCm" ON)
          option(CAFFE2_STATIC_LINK_CUDA "Statically link CUDA libraries" OFF)
          cmake_dependent_option(

ralphmao · November 9, 2021, 6:09pm

Thank you for your reply! libtorch_cuda_cpp.so is taking more than 2.5GB of space. I tried deleting it and so far the code runs without any problem. Very interesting.

eqy · November 9, 2021, 6:13pm

(see more potential details here):

github.com

pytorch/pytorch/blob/4262c8913c2bddb8d91565888b4871790301faba/caffe2/CMakeLists.txt#L177

    
      
          endif()
          
          
# Advanced: if we have allow list specified, we will do intersections for all
          # main lib srcs.
          if(CAFFE2_ALLOWLISTED_FILES)
            caffe2_do_allowlist(Caffe2_CPU_SRCS CAFFE2_ALLOWLISTED_FILES)
            caffe2_do_allowlist(Caffe2_GPU_SRCS CAFFE2_ALLOWLISTED_FILES)
            caffe2_do_allowlist(Caffe2_HIP_SRCS CAFFE2_ALLOWLISTED_FILES)
          endif()
          
          
if(BUILD_SPLIT_CUDA)
            # Splitting the source files that'll be in torch_cuda between torch_cuda_cu and torch_cuda_cpp
            foreach(tmp ${Caffe2_GPU_SRCS})
              if("${tmp}" MATCHES "(.*aten.*\\.cu|.*(b|B)las.*|.*((s|S)olver|Register.*CUDA|Legacy|THC|TensorShapeCUDA|BatchLinearAlgebra|ReduceOps|Equal|Activation|ScanKernels|Sort|TensorTopK).*\\.cpp)" AND NOT "${tmp}" MATCHES ".*(THC((CachingHost)?Allocator|General)).*")
                # Currently, torch_cuda_cu will have all the .cu files in aten, as well as some others that depend on those files
                list(APPEND Caffe2_GPU_SRCS_CU ${tmp})
              else()
                list(APPEND Caffe2_GPU_SRCS_CPP ${tmp})
              endif()
            endforeach()

It might be that you’re not hitting any of the functions defined there…

ralphmao · November 9, 2021, 6:18pm

Since torch-1.10.0 with cuda10.2 doesn’t have libtorch_cuda_cpp.so and still works fine on GPU, I suppose libtorch_cuda_cpp.so is only needed by caffe2.

eqy · November 9, 2021, 6:33pm

That doesn’t sound quite right… if you take a look at the file it’s going over sources file in aten so that warrants some caution

ptrblck · November 9, 2021, 9:01pm

No, the caffe2 bits are built in libcaffe2_...so. The library splitting was introduced to avoid the relocation issue introduced by the large library size during linking due to the size increase in CUDA, cuDNN etc.
The libtorch_cuda_cpp.so contains symbols for cuDNN, NCCL etc. Also, if I delete it, I’m unable to import torch and get:

ImportError: libtorch_cuda_cpp.so: cannot open shared object file: No such file or directory

so unsure what your workflow is.
TL;DR: don’t delete it, as this lib is needed.

lorenzznerol · November 15, 2021, 9:12pm

Not sure whether this adds any value, but it might help looking at how pytorch gets installed from source when it comes to questions about which uses which in pytorch. See I cannot use the pytorch that was built successfully from source: (DLL) initialization routine failed. Error loading caffe2_detectron_ops_gpu.dll. At least, there, the same players take part. Just a side note, I guess that this is not useful in the end.