Build error in PyTorch Vision

khushi-411 · October 29, 2021, 3:54am

Hi,
I’m excited to contribute to PyTorch Vision. Hence, I want to build PyTorch Vision in the local system.
I followed the CONTRIBUTION.md guide. But I’m getting the following error.

/home/khushi/anaconda3/lib/python3.8/site-packages/torch/include/c10/core/TensorImpl.h(2615): error: static assertion failed with "You changed the size of TensorImpl on 64-bit arch.See Note [TensorImpl size constraints] on how to proceed."

/home/khushi/Documents/vision/torchvision/csrc/ops/cuda/deform_conv2d_kernel.cu(231): warning: variable "device_guard" was declared but never referenced

/home/khushi/Documents/vision/torchvision/csrc/ops/cuda/deform_conv2d_kernel.cu(414): warning: variable "device_guard" was declared but never referenced

/home/khushi/Documents/vision/torchvision/csrc/ops/cuda/deform_conv2d_kernel.cu(658): warning: variable "device_guard" was declared but never referenced

/home/khushi/Documents/vision/torchvision/csrc/ops/cuda/deform_conv2d_kernel.cu(763): warning: variable "guard" was declared but never referenced

/home/khushi/Documents/vision/torchvision/csrc/ops/cuda/deform_conv2d_kernel.cu(925): warning: variable "guard" was declared but never referenced

/home/khushi/Documents/vision/torchvision/csrc/ops/cuda/deform_conv2d_kernel.cu(1057): warning: variable "guard" was declared but never referenced

1 error detected in the compilation of "/home/khushi/Documents/vision/torchvision/csrc/ops/cuda/deform_conv2d_kernel.cu".
error: command '/opt/cuda/bin/nvcc' failed with exit status 1

Will anyone please help me out to resolve the error?

Dependencies:

Cuda version: 11.4
GCC version: 11.1.0

Thanks

ptrblck · October 29, 2021, 11:08am

Are you trying to build directly from the master branch or is this already the branch with your changes?
In the latter case, check your git diff as you seem to have changed the size of TensorImpl:

/home/khushi/anaconda3/lib/python3.8/site-packages/torch/include/c10/core/TensorImpl.h(2615): error: static assertion failed with "You changed the size of TensorImpl on 64-bit arch.See Note [TensorImpl size constraints] on how to proceed."

in the former case, which commits are you using?

khushi-411 · October 29, 2021, 1:17pm

Hi @ptrblck!
Thanks for taking a look.

I tried both ways; via main and via the branch I created. Got the same error I haven’t committed anything yet. Started with setting up the environment.

Answers to the questions you asked:

The git diff command doesn’t output anything.
By commit, I’m assuming you are referring to the commands used for building. They are:

conda activate
conda install pytorch -c pytorch-nightly
git clone https://github.com/khushi-411/vision.git
cd vision
python setup.py develop

What do you think, what’s happening?

my3bikaht · October 29, 2021, 2:23pm

Please check if CUDA 11.4 supports gcc 11, probably not

khushi-411 · October 29, 2021, 2:55pm

Hi @my3bikaht!
Thanks for looking into it.

I did check the official page and some of the gist. According to those references CUDA 11.4 supports GCC 11.
A few notable links, worthing to mentions are:

Would love to get your inputs! Thanks

ptrblck · October 29, 2021, 7:10pm

That’s a great suggestion, as I haven’t noticed the GCC version.

@khushi-411 At least GCC 11.1 has a known bug in CUDA 11.4 (which is already fixed in CUDA 11.5), so you would either need to downgrade GCC or update CUDA.

khushi-411 · October 31, 2021, 8:36pm

Hi @ptrblck, Hi @my3bikaht!
Thanks for the suggestions!

I did work to set it up in both ways; via CUDA 11.5 and CUDA 10. and by degrading the gcc version to gcc-10.
Major Problems I am facing:

archlinux package does not have any upstream link for CUDA 11.5 (I stumbled from my side, I might be wrong).
I then turned to CUDA 10. using sudo pacman -S cuda-10.0 (Failed, since the target was not available) I found another command yay -S cuda-10.1 to install CUDA 10.0 in archlinux. This took more than 6 hrs to build.
Then I planned to degrade the gcc version. (Though, I personally, wanted to complete using CUDA). I tried many things. But currently, I am getting the following error:

/usr/bin/ld: eg: _ZSt3cin: invalid version 2 (max 0)
/usr/bin/ld: eg: error adding symbols: bad value
collect2: error: ld returned 1 exit status

SYSTEM CONFIGURATION

Manjaro Linx 21.0.0

Will you please give me some hints to resolve the error?
Thanks!

khushi-411 · November 2, 2021, 2:07pm

Hi @ptrblck, Hi @my3bikaht!

A gentle ping to you! Will you please look into the problem?
Thanks!

my3bikaht · November 2, 2021, 6:50pm

Sorry, not using arch linux. Browsed just now, archlinux has cuda 11.5.0-1 in packages.
Also I think you can install cudatoolkit, which has cuda as dependency. Either directly, or using conda.
6 hours for yay is crazy, seems like you were installing from source,

You can also download cuda specific version from here: Index of /archive/packages/c/cuda/ or here: Index of /packages/c/cuda/ and install using 'sudo pacman -U filename ’

Error you mentioned is a generic compiler error, we won’t be able to find the reason this way.

khushi-411 · November 3, 2021, 10:46am

No problem, @my3bikaht.
I’ll try to resolve the error with other methods too. Thanks!