Wrong results on CUDA

Hi, I compiled pytorch1.13.1, built with CUDA (CUDA11.6),When I ran this code on the machine where pytorch was compiled, it worked properly, but when I put the compiled pytorch on another machine, (it was confirmed that torch.caua.is_available () is True), and I found wrong results on CUDA, as the following figure:

I used ldd and patchelf to make sure all the dependencies were bundled and migrated to the new machine. Is there anything that could be causing this problem, I have some knowledge of the PyTorch source code and need some ideas to figure this out.

What’s the difference between the two machines? Are they using the same GPUs or from the same GPU family?

Thanks for your reply and reminder. The pytorch was compiled in a docker container which was installed ubuntu 20.04 and cuda toolkit 11.6, and it worked properly (results are corret) on ArchLinux-latest(the docker container also runs on it) with 2 RTX 8000 GPUs, Nvidia driver version is 535.113.01.
The another machine which got wrong results is installed ubuntu 22.04, with an A100 80G GPU, Nvidia driver version is 535.154.05.
That’s a weird question.

I’m also trying to switch to the v1.13.1 binary library I downloaded from the official website(pytorch.org), because I notice that on the “another machine” official pytorch1.13.1 (installed by conda) can give correct results. but both pre-cxx11-abi version and cxx11-abi will cause tons of linking error(both g++ and clang++). (While windows version works fine with msvc toolchain)

All operations will result in 0, not only ones, add.

I don’t know if you are building PyTorch from source, but it seems installing the binaries works:

so I would assume your build creates binaries which are somehow broken? Did you check the expected compute capabilities are supported?

Well, I embeded python interpreter into my application, and maked use of both libTorch and PyTorch. Official version 1.13.1 cxx11-abi libtorch dynamic libraries are not compatible with pytorch1.13.1’s _C.cpython-310-x86_64-linux-gnu.so. (Because official pytorch usees pre-cxx11 abi), so I have to compile pytorch myself.

Yes, I also think some binaries are somehow broken, but there are so many libraries, I can’t figure out which are broken… I remember that pytorch1.13.1 was compiled in lower compute capability than 8.0 (A100GPU support 8.0).

This problem has made me crazy

Compiling PyTorch by oneself is truly a terrible experience. Initially, I attempted to compile PyTorch directly on my Linux system, which led to a multitude of compilation errors. Consequently, I had no choice but to use the pytorch/pytorch Docker image for compilation. I thought this would resolve the issues, but later discovered that the computation results were incorrect when transferred to another machine. On Windows, I directly used the official PyTorch library released by the developers, which worked very well. However, on Linux, I was forced to compile PyTorch myself because the official PyTorch release’s _C.cpython-310-x86_64-linux-gnu.so is not compatible with the cxx11-abi. The cxx11abi version of libtorch provided on the PyTorch official website cannot be used by _C.cpython-310-x86_64-linux-gnu.so. My program requires the c++11 abi. I don’t understand why, in the year 2024, the official PyTorch release does not use the c++11 abi (torch._C._GLIBCXX_USE_CXX11_ABI shows False).

Now I’ trying to use docker to package my software with the environment of the machine which gives correct results. Hope to solve this problem :rofl:

Yes, indeed source builds might not be trivial in custom environments, but will also directly work in e.g. standard Ubuntu installations. Good luck with the docker approach!

Hi, I tried docker, the issue still occured. I noticed that my compiled pytorch ran

torch.cuda.get_arch_list()

it gave only a [‘sm_75’]

That ‘another machine’ has an A100 80G GPU (compute capability 8.0).

And I install pytorch by conda on this machine, it works properly:

They are on the same machine. So maybe it is bacause my compiled pytorch was not compiled for sm_80?

But I remember, 8.0 devices should be able to run 7.5 programs, why? :pleading_face:

Forgot to click “Reply”, sorry to reply again :slight_smile:

That’s not the case and you will need to compile PyTorch for sm_80 to be able to run on your Ampere GPU. Minor versions can be compatible, e.g. sm_89 is compatible to sm_86 and sm_80.

Yes,thanks. I’m sure it is the cause now. Our software contains some other CUDA kernel functions, which was compiled with sm_80, I just tested them and they worked fine on the A100 GPU machine.