Libtorch_cuda.so is too large (>2GB)

We use bazel as part of large monorepo to integrate with torch. In order to support large amounts of concurrent builds, we must execute our build actions remotely and this entails serializing the files needed for the build action. However, protobuf serialization has a hard limit of 2gb which makes the libtorch_cuda.so unable to be used remotely, which in turn makes many actions in our repository unable to execute remotely. Running these actions locally would be very bad for build performance, so I was looking for ways to shrink libtorch_cuda.so.

I tried stripping out debug symbols:
strip --strip-debug libtorch_cuda.so
strip --strip-unneeded libtorch_cuda.so
no effect.

Do you have any suggestions?

we got the wheels from:
https://download.pytorch.org/whl/torch_stable.html

we also build from source, but that ended up being 2.4GB. Not sure if there are better compiler flags / env variables to make it smaller.

Hi,

It should be possible to shrink it as we do meet the pypi max package size. You can see in the release that our cuda binaries are around 700MB.

You might want to make sure that the cuda architecture detection works fine and only compile for the architecture that is relevant for you. You can use TORCH_CUDA_ARCH_LIST to force specific architectures. If it compiles for all of them, that will make the binary very large.
Also building with DEBUG=0 will avoid all debug symbols.
Do these help?

1 Like

good tips! I’ll do a rebuild with these env variables and see if we get some smaller artifacts.

DEBUG=0 did not make a difference for our build. perhaps it was already off by default.
our TORCH_CUDA_ARCH_LIST is "5.2;6.1;7.0;7.5+PTX"

as an experiment, I removed 5.2 and the size went from 2.5GB to 2.4GB.
then removed 7.0 to go to 2.3GB

I did notice that cuda libraries got much larger between cuda 10.2 to 11 which is what revealed the size issues. But I’ll need to see if nividia has anything to say about that.

Our source build with TORCH_CUDA_ARCH_LIST="5.2 6.0 6.1 7.0 7.5 8.0 8.6+PTX" creates a libtorch_cuda.so of 1269MB with CUDA11.1, so I’m unsure why your build creates a bigger lib with less compute capabilities.

We are currently working internally to provide CUDA and cudnn conda and pip wheels, which would allow to dynamically link these runtimes for the binaries. However, this would not solve your local build issue.

Thank you for the extra data point.

looking at the already published wheels: https://download.pytorch.org/whl/torch_stable.html

I see the following:
cu102/torch-1.7.0-cp37-cp37m-linux_x86_64.whl -> torch/lib/libtorch_cuda.so: 1.1GB
cu110/torch-1.7.0%2Bcu110-cp37-cp37m-linux_x86_64.whl -> torch/lib/libtorch_cuda.so: 2.2GB

So something caused this artifact to double in size between cuda 10.2 and 11.
And from what I can tell between cuda releases, it may just be cuda itself.

I also tried using CXXFLAGS/CFLAGS with "-Oz" to optimize size. but this seems to have had no effect.

Here are some of the other options we use during local build:

export EXTRA_CAFFE2_CMAKE_FLAGS=("-DATEN_NO_TEST=ON")
export TORCH_NVCC_FLAGS="-Xfatbin -compress-all"
export TORCH_CUDA_ARCH_LIST="6.1;7.5+PTX"
export TH_BINARY_BUILD=1
export CMAKE_LIBRARY_PATH="/opt/intel/lib:/lib"
export CMAKE_INCLUDE_PATH="/opt/intel/include"
export DEBUG=0
export CXXFLAGS="-Oz"
export CFLAGS="-Oz"
export USE_CUDA=1
export USE_NNPACK=0
export USE_QNNPACK=0
export USE_FBGEMM=1
export USE_OPENMP=0
export USE_TBB=1
export USE_GFLAGS=0
export USE_GLOG=0
export BLAS=MKL
export ATEN_THREADING=TBB
export MKL_THREADING=TBB
export MKLDNN_CPU_RUNTIME=TBB
export PARALLEL_BACKEND=NATIVE_TBB
export CC=/clang_9.0.0/bin/clang
export CXX=/clang_9.0.0/bin/clang++
export PATCHELF_BIN=/usr/local/bin/patchelf
export VERBOSE=1

Maybe “fatbin” is a problem. Going to try that next.

nvidia suggested we use nvprune, but that doesn’t work if the object is not relocatable.

nvprune --arch sm_35 torch/lib/libtorch_cuda.so -o ~/Desktop/prune_test.so 
nvprune fatal   : Input file 'torch/lib/libtorch_cuda.so' not relocatable

Our specific goal would be to prune pytorch’s distributed whl to get it < 2GB.
Here is what that looks like:

# get
wget https://download.pytorch.org/whl/cu110/torch-1.7.0%2Bcu110-cp37-cp37m-linux_x86_64.whl
unzip https://download.pytorch.org/whl/cu110/torch-1.7.0%2Bcu110-cp37-cp37m-linux_x86_64.whl

# see info
du -h torch/lib/libtorch_cuda.so
2.1G torch/lib/libtorch_cuda.so

file torch/lib/libtorch_cuda.so
torch/lib/libtorch_cuda.so: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, BuildID[sha1]=e88b68dc083cd4dc547483c896d95ef6bedd64fc, with debug_info, not stripped

# see info after stripping
strip torch/lib/libtorch_cuda.so
file torch/lib/libtorch_cuda.so
torch/lib/libtorch_cuda.so: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, BuildID[sha1]=e88b68dc083cd4dc547483c896d95ef6bedd64fc, stripped

du -h torch/lib/libtorch_cuda.so
2.0G torch/lib/libtorch_cuda.so

# try prune
nvprune torch/lib/libtorch_cuda.so -arch sm_35 -o shrink_me.so
nvprune fatal : Input file 'torch/lib/libtorch_cuda.so' not relocatable

Can you share the script or the general logic of how you are building this. I can compare it to how we are building and hopefully spot the differences.

I’m building PyTorch via:

TORCH_CUDA_ARCH_LIST="5.2 6.0 6.1 7.0 7.5 8.0 8.6+PTX" \
CUDA_HOME="/usr/local/cuda" \
CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \
NCCL_INCLUDE_DIR="/usr/include/" \
NCCL_LIB_DIR="/usr/lib/" \
USE_SYSTEM_NCCL=1 \
python setup.py develop

It seems like TORCH_CUDA_ARCH_LIST is being ignored for me when I create the wheel:

I run:

  python3 setup.py clean --all
  time USE_STATIC_CUDNN=1 USE_STATIC_NCCL=1 ATEN_STATIC_CUDA=1 USE_CUDA_STATIC_LINK=1 \
       TORCH_CUDA_ARCH_LIST="5.2;6.1;7.0;7.5+PTX" \
       CXXFLAGS="-stdlib=libstdc++" \
       python3 setup.py bdist_wheel -d "${BUILD_DIR}"

but the results show:

./cuobjdump libtorch_cuda.so -lelf | awk -F. '{print $3}' | sort -u
cubin
sm_35
sm_37
sm_50
sm_52
sm_60
sm_61
sm_70
sm_75
sm_80

I also got an interesting warning. Is this relevant?

CMake Warning at cmake/External/nccl.cmake:65 (message):
Objcopy version is too old to support NCCL library slimming

maybe these architectures show up because they already exist in the cuda objects?

@albanD
are all of these sm_*s normal given what I passed TORCH_CUDA_ARCH_LIST?

At the beginning of the build process, during the cmake setup, you should have a line that lists the cuda architecture it will be using and why. Could you check that? You should be able to grep for CUDA in the build logs.

I see the CUDA NVCC flags chosen in the cmake log. They look correct according to using:
TORCH_CUDA_ARCH_LIST="5.2;6.1;7.0;7.5+PTX"

-- Found CUDA: /usr/local/cuda (found version "11.0") 
-- Caffe2: CUDA detected: 11.0
-- Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc
-- Caffe2: CUDA toolkit directory: /usr/local/cuda
-- Caffe2: Header version is: 11.0
-- Found CUDNN: /usr/local/cuda/lib64/libcudnn_static.a  
-- Found cuDNN: v8.0.4  (include: /usr/local/cuda/include, library: /usr/local/cuda/lib64/libcudnn_static.a)
CMake Warning at cmake/public/utils.cmake:196 (message):
  In the future we will require one to explicitly pass TORCH_CUDA_ARCH_LIST
  to cmake instead of implicitly setting it as an env variable.  This will
  become a FATAL_ERROR in future version of pytorch.
Call Stack (most recent call first):
  cmake/public/cuda.cmake:452 (torch_cuda_get_nvcc_gencode_flag)
  cmake/Dependencies.cmake:1097 (include)
  CMakeLists.txt:469 (include)


-- Added CUDA NVCC flags for: -gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_75,code=compute_75
CMake Warning at cmake/External/nccl.cmake:65 (message):
  Objcopy version is too old to support NCCL library slimming
Call Stack (most recent call first):
  cmake/Dependencies.cmake:1219 (include)
  CMakeLists.txt:469 (include)

Ok so it does look like it picks up these arguments just fine. Not sure what the issue is to be honest…

If I figure out what I’m doing wrong, would you expect
./cuobjdump libtorch_cuda.so -lelf | awk -F. '{print $3}' | sort -u
to show less results?

I don’t have a point of reference other than the public wheel which was specifically built for all architectures. But I’m working on trying to get those symbols out of the shared object. Does that goal make sense, or are they going to be there regardless?

Do you think there’s a script I can use in https://github.com/pytorch/builder that will produce exactly the same public wheel but only for the architectures we care about? Maybe it would be easier to use something closer to your CI.

The other mystery still remains. What made the public wheel double in size?
cu102/torch-1.7.0-cp37-cp37m-linux_x86_64.whl -> torch/lib/libtorch_cuda.so: 1.1GB
cu110/torch-1.7.0%2Bcu110-cp37-cp37m-linux_x86_64.whl -> torch/lib/libtorch_cuda.so: 2.2GB

I think the new archs were responsible for 300MB. But I don’t know for the other 700MB :confused:
cc @seemethere who know the builder repo quite well. He will be able to help you with that!