Pytorch 1.11.0a0 with CUDA compile failure

fulltopic · December 23, 2021, 1:07pm

Build Pytorch from source.

 export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
USE_LMDB=ON python setup.py install

The build failed at very beginning (about 3/5xxx) when building protobuf related.
The whole shell exited quietly without visible output.

The output of collect_env.py:
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A

OS: Fedora Linux 35 (Workstation Edition) (x86_64)
GCC version: (GCC) 11.2.1 20211203 (Red Hat 11.2.1-7)
Clang version: Could not collect
CMake version: version 3.19.6
Libc version: glibc-2.34

Python version: 3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.15.10-200.fc35.x86_64-x86_64-with-glibc2.34
Is CUDA available: N/A
CUDA runtime version: 11.5.119
GPU models and configuration: GPU 0: Quadro RTX 4000
Nvidia driver version: 495.29.05
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.21.2
[conda] blas 1.0 mkl
[conda] magma-cuda110 2.5.2 1 pytorch
[conda] mkl 2021.4.0 h06a4308_640
[conda] mkl-include 2021.4.0 h06a4308_640
[conda] mkl-service 2.4.0 py39h7f8727e_0
[conda] mkl_fft 1.3.1 py39hd3c417c_0
[conda] mkl_random 1.2.2 py39h51133e4_0
[conda] numpy 1.21.2 py39h20f2e39_0
[conda] numpy-base 1.21.2 py39h79a1101_0
(pytorch) [zf@192 pytorch_1_11]$

Manually complete:
PyTorch version: 1.11.0a0
Is debug build: Y
CUDA used to build PyTorch: 11.5
ROCM used to build PyTorch: N/A

OS: Fedora Linux 35 (Workstation Edition) (x86_64)
GCC version: (GCC) 11.2.1 20211203 (Red Hat 11.2.1-7)
Clang version: Could not collect
CMake version: version 3.19.6
Libc version: glibc-2.34

Python version: 3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.15.10-200.fc35.x86_64-x86_64-with-glibc2.34
Is CUDA available: Y
CUDA runtime version: 11.5.119
GPU models and configuration: GPU 0: Quadro RTX 4000
Nvidia driver version: 495.29.05
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.21.2
[conda] blas 1.0 mkl
[conda] magma-cuda110 2.5.2 1 pytorch
[conda] mkl 2021.4.0 h06a4308_640
[conda] mkl-include 2021.4.0 h06a4308_640
[conda] mkl-service 2.4.0 py39h7f8727e_0
[conda] mkl_fft 1.3.1 py39hd3c417c_0
[conda] mkl_random 1.2.2 py39h51133e4_0
[conda] numpy 1.21.2 py39h20f2e39_0
[conda] numpy-base 1.21.2 py39h79a1101_0
(pytorch) [zf@192 pytorch_1_11]$

I’ve had nvcc installed but don’t know if it worked:
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

It may be caused by memory/cpu shortage:

Would you please help? Thank you.

ptrblck · December 23, 2021, 9:37pm

This might point to an out of memory issue, but I would expect to see a “Killed” message in your terminal.
Based on your screenshot also cudafe++ and cicc processes are used, so nvcc is compiling something, which wouldn’t fit into the crash at the protobuf stage.

In any case, try to rebuild it via MAX_JOBS=1 python setup.py install and see if this would work.

fulltopic · December 24, 2021, 3:27am

I’ve tried several times. Once the build terminated when building nccl related and the screenshot was printed at that time. It’s also at the very beginning. I continued the building without clean, then it failed at protobuf compile.

Anyway, it’s more like a memory shortage issue as the process kswapd0 is the top1 consumer of top command output.

fulltopic · December 27, 2021, 8:09am

MAX_JOBS works. Thank you.