Building torch from source is failing

venkataramesh · November 4, 2024, 10:36pm

Hi,

I’m looking to pursue contributing to PyTorch and as part of this i’m looking to set up the environment.

As a first step, i’m looking to build from source locally.

I have tried multiple ways of doing this job.

On a fresh ec2 instance with ubuntu 22.04 image
On my personal work station with rtx 4090
On my personal laptop with RTX 5000 using wsl and docker.
Each time im seeing some issues.

One standard issue i’m seeing is that the build is failing with this error

[2309/2331] Linking CXX executable bin/torch_shm_manager
FAILED: bin/torch_shm_manager
: && /usr/bin/g++ -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=ON -DUSE_X86_SIMD_SORT -DXSS_USE_OPENMP -DUSE_PYTORCH_QNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow -DHAVE_AVX512_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION -g -fno-omit-frame-pointer -O0 -L/opt/conda/lib/ -rdynamic -Wl,–no-as-needed caffe2/torch/lib/libshm/CMakeFiles/torch_shm_manager.dir/manager.cpp.o -o bin/torch_shm_manager -Wl,-rpath,/workspaces/pytorch/build/lib: lib/libshm.so -lrt lib/libc10.so -Wl,-rpath-link,/workspaces/pytorch/build/lib && /opt/conda/bin/cmake -E __run_co_compile --lwyu=“ldd;-u;-r” --source=bin/torch_shm_manager && :
/usr/bin/ld: warning: libopenblas.so.0, needed by /workspaces/pytorch/build/lib/libtorch_cpu.so, not found (try using -rpath or -rpath-link)

I tried following the dev container way as well as the local conda environment.

I also matched the CUDA version inside the dev container to the driver i had.

Any help is appreciated.

Here is the google drive link for the full log

Thanks in Advance,
Venkat

ptrblck · November 4, 2024, 11:17pm

CUDA is unrelated to your error which points to a missing OpenBLAS library:

Did you try installing any BLAS library, e.g. OpenBLAS, MKL etc.?

venkataramesh · November 4, 2024, 11:40pm

Couple of things i checked.

From this blog post, it should run out of the box and the only thing i needed to do is to match the CUDA version.
From .devcontainer/cuda/environment.yml, i see that it has the libopenblas as dependency. So i assumed it should be have been installed.

From the main README.md, there is a section called Install dependencies. When i was trying the conda environment, i tried them. But right now i’m giving a try to install all those dependencies inside the dev container.

Only thing missing is the mkl part from the log. I will keep you posted.

Thanks for your reply and time.

venkataramesh · November 5, 2024, 1:14am

Even after the dependency installation, it is still failing. Must be something other than blas, mkl libraries.

I also read somewhere that these kind of errors could be caused by large parallelism. So i tried chaning the MAX_JOBS anywhere from 2-16 and still see these errors.

Thanks
Venkat

venkataramesh · November 5, 2024, 1:15am

Also, it is not specific to CUDA container, it also failed the same way for cpu container as well.

Since the failure is pointing to torch_shm_manager, i also tried increasing the shared mem for the container but to no avail!

venkataramesh · November 5, 2024, 3:43am

@leimao do you have any thoughts on this?

leimao · November 5, 2024, 4:17am

I have not been using PyTorch recently, so I don’t know whether the latest PyTorch can be installed successfully on my machine. (But it definitely can be built at least for some configurations. Otherwise the PyTorch team would not be able to release the build.)
Your problem does not seem to be a parallelized building problem to me, because, as indicated clearly, it’s missing some dependencies.
If the problem is originated from the PyTorch official DevContainer, you should file a bug on PyTorch GitHub.
You might also want to try building my PyTorch clone 9 months ago. You can either use the DevContainer or my custom container. I remember I used that clone for building PyTorch from scratch for my two blog posts. If you could build that clone but could not build the latest PyTorch from the official main, definitely there is something wrong in their building config and it was not captured by the build test.

venkataramesh · November 5, 2024, 8:28pm

Hello Lei,

I tried your custom container and it builds fine there.

I can go two routes.

Get started with your container to play around.
file a bug in pytorch and simultaneously figure out whats wrong with the devcontainer in the master repo.

Thanks for pitching in.