Performance degration when building from source

yzs · May 26, 2023, 3:42pm

Hi, I’m experiencing performance degradation when trying to compile from source, and I’m wondering if there are some details I’m missing.

Using pytorch v2.0.0 as an example, I used both cuda 11.7 and cudnn 8.5 (consistent with builder/install_cuda.sh at main · pytorch/builder · GitHub in pytorch/builder ), and read carefully about the log of nightly build of the v2.0.0 commit in github actions ([inductor] use triu ref instead of lowering (#96040) (#96462) · pytorch/pytorch@c263bd4 · GitHub).

The final compiling command of my own is

export USE_NINJA=OFF
export TORCH_NVCC_FLAGS="-Xfatbin -compress-all"
export USE_STATIC_NCCL=1
export TORCH_CUDA_ARCH_LIST='8.0'
export INSTALL_TEST=0
export USE_FBGEMM=1

export CUDA_HOME=/usr/local/cuda-11.7/
export CUDACXX=/usr/local/cuda-11.7/bin/nvcc
export LD_LIBRARY_PATH=/usr/local/cuda-11.7/lib64/:$LD_LIBRARY_PATH
export CUDNN_LIBRARY_PATH=/usr/local/cuda-11.7/lib64/

CMAKE_PREFIX_PATH="$(dirname $(which conda))/../"  python setup.py install

I found that the pytorch I compiled myself has a performance difference with the pytorch obtained by conda install: using the mnist example on A100, the forward time increased by 5%.

eqy · May 26, 2023, 7:23pm

What commit did you build from source with and were you able to verify that e.g., all library versions (cuDNN, cuBLAS, etc.) were the same between your source build and the upstream wheels?

Would you also be able to share how you are benchmarking MNIST?

yzs · May 27, 2023, 2:09am

Thanks for the quick reply!

The specific commit is c263bd4 (the v2.0.0 tag). The torch.__version__ returns 2.0.0a0+gitc263bd4.

I could confirm that the cuDNN is the same (cuDNN v8.5.0, installed according to the script in pytorch/builder, both returns 8500 for torch.backends.cudnn.version()). The other CUDA libraries (e.g., cuBlas) should be shipped with CUDA 11.7.0. NCCL is installed automatically by setup.py when setting USE_STATIC_NCCL=1 in my case while it may be not relevant.

The mnist example I used is import torchimport argparseimport torch.nn as nnimport torch.nn.functional - Pastebin.com. Use random data as input and add a CUDA event-based timer to record time. It will print average step forward time after each epoch

eqy · May 27, 2023, 5:50am

Interesting, does profiling with e.g., nsys nvprof show the same kernels for both the source build vs. the prebuilt binaries? If so then we can basically rule out differences due to math libraries.

yzs · May 29, 2023, 3:38am

Thanks for the hint.

After investigating the kernel launched from nsys, I found that they are launching exactly the same kernel, attached below.

Compiling from source:

Installed from conda:

Another interesting thing is the cold start overhead between the compiled and the installed differs from each other greatly and could be stably reproduced. Therefore, I believe there should be some differences between them.

eqy · May 29, 2023, 10:20pm

Are the differences present only on cold-start and with just a single layer? I’m curious if it could be caused by e.g., lazy module loading: CUDA C++ Programming Guide which was introduced in 11.7.

yzs · May 30, 2023, 7:13am

Thanks for the explanation. I suppose I am not using lazy loading since I only upgrade CUDA to 11.7, without a R515+ driver.

Leaving the cold start difference back, my primary concern is that are there any differences (e.g., compiling args) between building PyTorch from source with wheels (e.g., from conda)? I believe that if I understand the scripts in pytorch/builder right, the scripts and envs above should exactly match the building procedure, thus producing the same artifact (not including version). Or if there is something I have already missed?

Kehuan_Feng · August 27, 2024, 12:33pm

@yzs I met similar performance issue when building from source. Have you figured out the root cause?

yzs · November 5, 2024, 2:29am

Not yet. Cannot achieve the same performance.