We use bazel as part of large monorepo to integrate with torch. In order to support large amounts of concurrent builds, we must execute our build actions remotely and this entails serializing the files needed for the build action. However, protobuf serialization has a hard limit of 2gb which makes the libtorch_cuda.so unable to be used remotely, which in turn makes many actions in our repository unable to execute remotely. Running these actions locally would be very bad for build performance, so I was looking for ways to shrink libtorch_cuda.so.
I tried stripping out debug symbols:
strip --strip-debug libtorch_cuda.so
strip --strip-unneeded libtorch_cuda.so
no effect.
It should be possible to shrink it as we do meet the pypi max package size. You can see in the release that our cuda binaries are around 700MB.
You might want to make sure that the cuda architecture detection works fine and only compile for the architecture that is relevant for you. You can use TORCH_CUDA_ARCH_LIST to force specific architectures. If it compiles for all of them, that will make the binary very large.
Also building with DEBUG=0 will avoid all debug symbols.
Do these help?
DEBUG=0 did not make a difference for our build. perhaps it was already off by default.
our TORCH_CUDA_ARCH_LIST is "5.2;6.1;7.0;7.5+PTX"
as an experiment, I removed 5.2 and the size went from 2.5GB to 2.4GB.
then removed 7.0 to go to 2.3GB
I did notice that cuda libraries got much larger between cuda 10.2 to 11 which is what revealed the size issues. But I’ll need to see if nividia has anything to say about that.
Our source build with TORCH_CUDA_ARCH_LIST="5.2 6.0 6.1 7.0 7.5 8.0 8.6+PTX" creates a libtorch_cuda.so of 1269MB with CUDA11.1, so I’m unsure why your build creates a bigger lib with less compute capabilities.
We are currently working internally to provide CUDA and cudnn conda and pip wheels, which would allow to dynamically link these runtimes for the binaries. However, this would not solve your local build issue.
I see the following:
cu102/torch-1.7.0-cp37-cp37m-linux_x86_64.whl -> torch/lib/libtorch_cuda.so: 1.1GB
cu110/torch-1.7.0%2Bcu110-cp37-cp37m-linux_x86_64.whl -> torch/lib/libtorch_cuda.so: 2.2GB
So something caused this artifact to double in size between cuda 10.2 and 11.
And from what I can tell between cuda releases, it may just be cuda itself.
I also tried using CXXFLAGS/CFLAGS with "-Oz" to optimize size. but this seems to have had no effect.
Here are some of the other options we use during local build:
Can you share the script or the general logic of how you are building this. I can compare it to how we are building and hopefully spot the differences.
At the beginning of the build process, during the cmake setup, you should have a line that lists the cuda architecture it will be using and why. Could you check that? You should be able to grep for CUDA in the build logs.
I see the CUDA NVCC flags chosen in the cmake log. They look correct according to using: TORCH_CUDA_ARCH_LIST="5.2;6.1;7.0;7.5+PTX"
-- Found CUDA: /usr/local/cuda (found version "11.0")
-- Caffe2: CUDA detected: 11.0
-- Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc
-- Caffe2: CUDA toolkit directory: /usr/local/cuda
-- Caffe2: Header version is: 11.0
-- Found CUDNN: /usr/local/cuda/lib64/libcudnn_static.a
-- Found cuDNN: v8.0.4 (include: /usr/local/cuda/include, library: /usr/local/cuda/lib64/libcudnn_static.a)
CMake Warning at cmake/public/utils.cmake:196 (message):
In the future we will require one to explicitly pass TORCH_CUDA_ARCH_LIST
to cmake instead of implicitly setting it as an env variable. This will
become a FATAL_ERROR in future version of pytorch.
Call Stack (most recent call first):
cmake/public/cuda.cmake:452 (torch_cuda_get_nvcc_gencode_flag)
cmake/Dependencies.cmake:1097 (include)
CMakeLists.txt:469 (include)
-- Added CUDA NVCC flags for: -gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_75,code=compute_75
CMake Warning at cmake/External/nccl.cmake:65 (message):
Objcopy version is too old to support NCCL library slimming
Call Stack (most recent call first):
cmake/Dependencies.cmake:1219 (include)
CMakeLists.txt:469 (include)
If I figure out what I’m doing wrong, would you expect ./cuobjdump libtorch_cuda.so -lelf | awk -F. '{print $3}' | sort -u
to show less results?
I don’t have a point of reference other than the public wheel which was specifically built for all architectures. But I’m working on trying to get those symbols out of the shared object. Does that goal make sense, or are they going to be there regardless?
Do you think there’s a script I can use in https://github.com/pytorch/builder that will produce exactly the same public wheel but only for the architectures we care about? Maybe it would be easier to use something closer to your CI.
The other mystery still remains. What made the public wheel double in size?
cu102/torch-1.7.0-cp37-cp37m-linux_x86_64.whl -> torch/lib/libtorch_cuda.so: 1.1GB
cu110/torch-1.7.0%2Bcu110-cp37-cp37m-linux_x86_64.whl -> torch/lib/libtorch_cuda.so: 2.2GB
I think the new archs were responsible for 300MB. But I don’t know for the other 700MB
cc @seemethere who know the builder repo quite well. He will be able to help you with that!