Wrong results on CUDA

HfCloud · March 18, 2024, 3:23pm

Hi, I compiled pytorch1.13.1, built with CUDA (CUDA11.6)，When I ran this code on the machine where pytorch was compiled, it worked properly, but when I put the compiled pytorch on another machine, (it was confirmed that torch.caua.is_available () is True), and I found wrong results on CUDA, as the following figure:

I used ldd and patchelf to make sure all the dependencies were bundled and migrated to the new machine. Is there anything that could be causing this problem, I have some knowledge of the PyTorch source code and need some ideas to figure this out.

ptrblck · March 18, 2024, 9:53pm

What’s the difference between the two machines? Are they using the same GPUs or from the same GPU family?

HfCloud · March 19, 2024, 1:51am

Thanks for your reply and reminder. The pytorch was compiled in a docker container which was installed ubuntu 20.04 and cuda toolkit 11.6, and it worked properly (results are corret) on ArchLinux-latest(the docker container also runs on it) with 2 RTX 8000 GPUs, Nvidia driver version is 535.113.01.
The another machine which got wrong results is installed ubuntu 22.04, with an A100 80G GPU, Nvidia driver version is 535.154.05.
That’s a weird question.

I’m also trying to switch to the v1.13.1 binary library I downloaded from the official website(pytorch.org), because I notice that on the “another machine” official pytorch1.13.1 (installed by conda) can give correct results. but both pre-cxx11-abi version and cxx11-abi will cause tons of linking error(both g++ and clang++). (While windows version works fine with msvc toolchain)

HfCloud · March 19, 2024, 1:54am

All operations will result in 0, not only ones, add.

ptrblck · March 19, 2024, 11:01pm

I don’t know if you are building PyTorch from source, but it seems installing the binaries works:

so I would assume your build creates binaries which are somehow broken? Did you check the expected compute capabilities are supported?

HfCloud · March 20, 2024, 2:12am

Well, I embeded python interpreter into my application, and maked use of both libTorch and PyTorch. Official version 1.13.1 cxx11-abi libtorch dynamic libraries are not compatible with pytorch1.13.1’s _C.cpython-310-x86_64-linux-gnu.so. (Because official pytorch usees pre-cxx11 abi), so I have to compile pytorch myself.

Yes, I also think some binaries are somehow broken, but there are so many libraries, I can’t figure out which are broken… I remember that pytorch1.13.1 was compiled in lower compute capability than 8.0 (A100GPU support 8.0).

This problem has made me crazy

HfCloud · March 20, 2024, 4:51am

Compiling PyTorch by oneself is truly a terrible experience. Initially, I attempted to compile PyTorch directly on my Linux system, which led to a multitude of compilation errors. Consequently, I had no choice but to use the pytorch/pytorch Docker image for compilation. I thought this would resolve the issues, but later discovered that the computation results were incorrect when transferred to another machine. On Windows, I directly used the official PyTorch library released by the developers, which worked very well. However, on Linux, I was forced to compile PyTorch myself because the official PyTorch release’s _C.cpython-310-x86_64-linux-gnu.so is not compatible with the cxx11-abi. The cxx11abi version of libtorch provided on the PyTorch official website cannot be used by _C.cpython-310-x86_64-linux-gnu.so. My program requires the c++11 abi. I don’t understand why, in the year 2024, the official PyTorch release does not use the c++11 abi (torch._C._GLIBCXX_USE_CXX11_ABI shows False).

Now I’ trying to use docker to package my software with the environment of the machine which gives correct results. Hope to solve this problem

ptrblck · March 20, 2024, 1:49pm

Yes, indeed source builds might not be trivial in custom environments, but will also directly work in e.g. standard Ubuntu installations. Good luck with the docker approach!

HfCloud · March 20, 2024, 2:15pm

Hi, I tried docker, the issue still occured. I noticed that my compiled pytorch ran

torch.cuda.get_arch_list()

it gave only a [‘sm_75’]

That ‘another machine’ has an A100 80G GPU (compute capability 8.0).

And I install pytorch by conda on this machine, it works properly:

They are on the same machine. So maybe it is bacause my compiled pytorch was not compiled for sm_80?

But I remember, 8.0 devices should be able to run 7.5 programs, why?

HfCloud · March 20, 2024, 2:16pm

Forgot to click “Reply”, sorry to reply again

ptrblck · March 20, 2024, 2:25pm

That’s not the case and you will need to compile PyTorch for sm_80 to be able to run on your Ampere GPU. Minor versions can be compatible, e.g. sm_89 is compatible to sm_86 and sm_80.

HfCloud · March 20, 2024, 3:02pm

Yes，thanks. I’m sure it is the cause now. Our software contains some other CUDA kernel functions, which was compiled with sm_80, I just tested them and they worked fine on the A100 GPU machine.