Linking errors when using nightly prebuilt binaries

IgnacioPickering · February 10, 2021, 12:32am

I’m currently using the prebuilt binary version of pytorch nightly, built with cuda 11.0.
which I installed from Start Locally | PyTorch, using pip.

I have a custom model that I can normally run on my GPU (Quadro RTX 600) without any issues
(My cuda driver is 460.32.03)

My issue is that if I try to evaluate my model after jit-scripting it, I get a linker error:

    result = self.forward(*input, **kwargs)
../../../anaconda3/envs/myenv/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py:132: in prof_meth_call
    return prof_callable(meth_call, *args, **kwargs)
../../../anaconda3/envs/myenv/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py:126: in prof_callable
    return callable(*args, **kwargs)
E   RuntimeError: Error in dlopen or dlsym: libnvrtc.so.11.0: cannot open shared object file: No such file or directory

This is very strange to me, since I believed that if I used the prebuilt binaries all necessary cuda libraries would be correctly linked with pytorch, and pytorch would not be using / needing my systems cuda. Am I mistaken about this?

Furthermore, I actually do have cuda 11.2 installed in my system, for reference the output of nvcc --version is

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Nov_30_19:08:53_PST_2020
Cuda compilation tools, release 11.2, V11.2.67
Build cuda_11.2.r11.2/compiler.29373293_0

Also, this error doesn’t appear at all if I use pytorch 1.7.1

Is it possible that there is actually something wrong with my cuda installation / driver or is this some internal pytorch error?

Other potentially useful info: The linking error appears if I use the default fuser; if I run this using “fuser2” instead (the nvfuser I believe) then I get:
E RuntimeError: false INTERNAL ASSERT FAILED at "/pytorch/torch/csrc/jit/codegen/cuda/type.cpp":439, please report a bug to PyTorch. No data type found for scalar type.

This is frustrating since my custom model is very large, and I’m unable to pinpoint which part exactly is the one that is causing this, so I think it is very unlikely I would be able to post some minimal working example.

ptrblck · February 11, 2021, 6:54am

Thanks for reporting it. Issue is tracked here.