MKL error pytorch on Azure ubuntu VM

Chris_Palmer · March 30, 2018, 1:35am

I am having a problem after installing the latest pytorch into an Azure Data Science Virtual Machine, via the fast.ai conda environment.

I have CUDA 9 installed, and after following a thread linked to Issue #5099 I tried upgrading MKL, but all it offered me was and upgrade from 2018.0.2-1 to 2018.0.2-intel_1 intel, so it doesn’t seem out of date.

I believe I have have the latest version of pytorch, as I’ve just installed it using the conda install process and -c soumith. Unfortunately I cannot verify my torch version as I have an error loading torch, that there is an undefined symbol: mkl_lapack_ao_ssyrdb in /opt/intel/mkl/lib/intel64/libmkl_gf_lp64.so. (see below for trace).

Following the Issue #5099 thread I also executed echo $LD_LIBRARY_PATH, and got this:
/opt/intel/tbb/lib/intel64/gcc4.7:/opt/intel/lib/intel64:/opt/intel/mkl/lib/intel64:/dsvm/tools/cntk/cntk/lib:/dsvm/tools/cntk/cntk/dependencies/lib::/usr/local/cuda/lib64

Any idea about how to fix this?

`
import torch

ImportError Traceback (most recent call last)
in ()
----> 1 import torch

~/.conda/envs/fastai/lib/python3.6/site-packages/torch/init.py in ()
54 except ImportError:
55 pass
—> 56 from torch._C import *
57
58 all += [name for name in dir(_C)

ImportError: /opt/intel/mkl/lib/intel64/libmkl_gf_lp64.so: undefined symbol: mkl_lapack_ao_ssyrdb
`

SimonW · March 30, 2018, 2:00am

Is the symbol actually in the .so? Can you try nm -an libmkl_gf_lp64 .so | grep mkl_lapack_ao_ssyrdb

Chris_Palmer · March 30, 2018, 2:28am

Thanks Simon, for getting involved!

I tried exactly what you said, but the libmkl_gf_lp64 was not found.

So I tried specifying the path to the one that Pytorch errored on, and the response was Unknown:

nm -an  /opt/intel/mkl/lib/intel64/libmkl_gf_lp64.so | grep mkl_lapack_ao_ssyrdb
U mkl_lapack_ao_ssyrdb

I tried shortening the value being looked for so I could see what is listed:

Chris_Palmer@FASTAI:~$ nm -an  /opt/intel/mkl/lib/intel64/libmkl_gf_lp64.so | grep mkl_lapack_ao_
                 U mkl_lapack_ao_cgeqrf
                 U mkl_lapack_ao_cgetrf
                 U mkl_lapack_ao_cgetrfnpi
                 U mkl_lapack_ao_cpotrf
                 U mkl_lapack_ao_cspffrt2
                 U mkl_lapack_ao_cspffrtx
                 U mkl_lapack_ao_dgeqrf
                 U mkl_lapack_ao_dgetrf
                 U mkl_lapack_ao_dgetrfnpi
                 U mkl_lapack_ao_dpotrf
                 U mkl_lapack_ao_dspffrt2
                 U mkl_lapack_ao_dspffrtx
                 U mkl_lapack_ao_dsyrdb
                 U mkl_lapack_ao_inquiry
                 U mkl_lapack_ao_sgeqrf
                 U mkl_lapack_ao_sgetrf
                 U mkl_lapack_ao_sgetrfnpi
                 U mkl_lapack_ao_spotrf
                 U mkl_lapack_ao_sspffrt2
                 U mkl_lapack_ao_sspffrtx
                 U mkl_lapack_ao_ssyrdb
                 U mkl_lapack_ao_zgeqrf
                 U mkl_lapack_ao_zgetrf
                 U mkl_lapack_ao_zgetrfnpi
                 U mkl_lapack_ao_zpotrf
                 U mkl_lapack_ao_zspffrt2
                 U mkl_lapack_ao_zspffrtx

I looked for all instances of libmkl_gf_lp64.so, and found many candidates.

When I execute nn -an against any of these I get no response coming back… is that bad?

On the most likely candidate I tried again the first part of the name to get near:

nm -an  /home/Chris_Palmer/.conda/envs/fastai/lib/libmkl_gf_lp64.so | grep mkl_lapack_ao_
U mkl_lapack_ao_inquiry

If any of this seem to be OK, what’s the optimal way to tell the system to use the correct one?

/home/Chris_Palmer/.conda/envs/fastai/lib/libmkl_gf_lp64.so
/data/home/Chris_Palmer/.conda/envs/fastai/lib/libmkl_gf_lp64.so
/anaconda/lib/libmkl_gf_lp64.so
/anaconda/pkgs/mkl-2018.0.0-hb491cac_4/lib/libmkl_gf_lp64.so
/anaconda/pkgs/mkl-2018.0.2-intel_1/lib/libmkl_gf_lp64.so
/anaconda/pkgs/mkl-2018.0.2-1/lib/libmkl_gf_lp64.so

SimonW · March 30, 2018, 10:19pm

Hmm this seems another instance of https://github.com/pytorch/pytorch/issues/6131. Can you remove either your system mkl or conda mkl?

Chris_Palmer · March 31, 2018, 12:27am

Following that link and other it linked to, there seems to be a bit of discussion about these sorts of issues. Most of which related to building Pytorch in situ, MKL, and NNPACK. Do you think its possible that a new version of Pytorch might address these issues?

Regarding your advice, perhaps I’ll try removing the Conda MKL - the system seems to be locked up (I wasn’t permitted to upgrade the system MKL).

Do you know how to determine the MKL version in each directory? (I’ve searched but have found no clear instruction - I only found my CONDA versions by doing the INSTALL again, when it reported the versions to me).

Can you please confirm what exactly I should do? Should I just CONDA UNINSTALL my conda installations?

Is there no PATH setting I can use to force this to see the right information?

Thanks

SimonW · March 31, 2018, 5:16am

I was not saying that it will fix your problem. I was saying that it worth a try to remove one install. So maybe conda uninstall works, maybe it doesn’t.

I’m not sure if there is any env var setting for this.

Btw, the latest pytorch version is at -c pytorch, and you can check install version by directly looking at the torch/version,py

Chris_Palmer · March 31, 2018, 6:26am

Sure, I will try it.

Actually I am unsure what it meant in my error that the reference was unknown. Is it because something is missing from my system, or that there is a conflict trying to determine where to go for it? Does “undefined symbol: mkl_lapack_ao_ssyrdb” mean that mkl_lapack_ao_ssyrdb should be able to be found somewhere but cannot be? Where would it normally be found? Is it referenced because Pytorch requires it, or just because its mentioned in libmkl_gf_lp64.so?

dae · March 31, 2018, 6:44am

It looks there’s a backward compatibility issue with your version of mkl mkl-2018.0.2 in your output.
You can try to downgrade it to previous version: conda install mkl=2018.0.1
And restart the notebook.

Chris_Palmer · March 31, 2018, 6:54am

Actually, I already had 2018.0.1 when the error occurred - the 2018.0.2 came in when I reinstalled MKL (using conda) as an attempt to resolve the problem.

But I believe that the problem I have is that it is going to the system MKL
/opt/intel/mkl/lib/intel64/libmkl_gf_lp64.so

rather than the one that’s compatible with my anaconda installation
/home/Chris_Palmer/.conda/envs/fastai/lib/libmkl_gf_lp64.so

which I “upgraded” to 2018.0.2 after encountering the error, but which I can revert to 2018.0.1

How can I determine the version of the library in /opt/intel/mkl/lib/intel64 ? I think it’s not possible for me to alter it but it would be nice to know what I am dealing with!

How to force my system to use the correct one?

dae · April 1, 2018, 1:00am

You can run conda list | grep mkl to see the version you have. If you still can’t figure it out, you might want to try setting up again from scratch with proven working setup.

If you are setting up the environment for fast.ai with latest pytorch on Azure, you can try setting up with the following proven steps

Create a new Deep Learning Virtual Machine resource, and choose Linux, new Resource Group (easier to manage later), HDD, NC6
Connect to the VM: ssh username@ip
Download the setup script: curl http://files.fast.ai/setup/paperspace -o setup.sh
Remove line sudo rm '/etc/apt/apt.conf.d/*.*' using vim setup.sh
Make script executable: chmod +x setup.sh
Execute the script: ./setup.sh
Follow the script and reboot after successful installation, everything should be setup with latest installs.
Important: Downgrade mkl back to prior version: conda install mkl=2018.0.1