Segfault in libtorch_cpu running fastai

g-clef · October 7, 2020, 2:35pm

Hello, all,

I’m trying to get a local install of fastai running…I was hoping that using the fastai docker images would spare me having to install and manage the fastai and pytorch libraries myself, but I’m running into a segfault in pytorch, which I’m not sure how to fix.

My setup:
CPU: Intel® Core™ i7 CPU 950 @ 3.07GHz
RAM: 12GB
Video card: GeForce RTX 2070 SUPER
OS: Ubuntu 20.04 (clean, just re-installed)
Nvidia driver: Ubuntu-provided nvidia-450
Followed instructions to install docker and nvidia-docker extensions.
Running torch.cuda.is_available() in the fastai docker container returns True.

The error:
In Jupyter the kernel crashes & is restarted at the first “learn.fine_tune(1)” line.
In dmesg, there’s a line that says: traps: python[1910] trap invalid opcode ip:7fc9d0d63869 sp:7fff30e315a0 error:0 in libtorch_cpu.so[7fc9cfa41000+6754000]

If it helps, I had fastai v1 working on this hardware back in March, but got derailed by life and just picked it back up now. So, I know this setup can work, but something’s changed since March that’s not agreeing with my setup.

Has anyone seen this before, or have an idea of what I did wrong?

ptrblck · October 10, 2020, 4:22am

Are you seeing this issue only with the FastAI installation / docker image or also if you install the PyTorch binaries?
If I’m not mistaken these kind of errors are raised if your CPU encounters unsupported instructions, e.g. avx instructions on older CPUs.

g-clef · October 10, 2020, 1:07pm

I tried installing pytorch via conda locally (outside the docker container) and I’m seeing the same thing, yeah. Is there a way to configure/compile pytorch to not use those newer instructions?

I admit the CPU itself is a bit old (1st gen core i7, Nehalem)…I repurposed my old game machine and swapped out the GPU for something recent and powerful. I had hoped the CPU wouldn’t matter that much if the GPU was recent.

Thanks for your help.

ptrblck · October 11, 2020, 9:40am

If you build from source, cmake might automatically detect the CPU capability and might disable e.g. AVX, if it’s not supported.
Could you try that and see if it would be working?

g-clef · October 12, 2020, 8:33pm

I tried that, and got…mixed results.

The build finished properly when I ran it on master, but master is calling itself version 1.8.0, and FastAI apparently isn’t compatible with that version…it’s expecting 1.6. I tried loading it anyway, and it exceptions with an error about “FakeLoader” not having a “persistent_workers” attribute.

If I checkout the v1.6.0 tag, the build fails with errors like:

../caffe2/quantization/server/conv_dnnlowp_op.cc:1211:55: error: ‘depthwise_3x3x3_per_channel_quantization_pad_1’ was not declared in this scope
         depthwise_3x3x3_per_channel_quantization_pad_1(

Is there something special I should be doing to build 1.6 instead of master?

Thanks again for the help.

ptrblck · October 13, 2020, 7:59am

Did you update all submodules after the 1.6 branch checkout?
Also, did you clean the build via python setup.py clean?
These types of error are often raised if the build is trying to reuse some temp. files from the previous builds.

g-clef · October 14, 2020, 7:10pm

Rather than mess with cleanup, I just deleted the folder & re-cloned it. I did:

git clone --recursive GitHub - pytorch/pytorch: Tensors and Dynamic neural networks in Python with strong GPU acceleration
cd pytorch
git pull
git checkout tags/v1.6.0
git pull
python3 setup.py build

and the end result was the same:

…/caffe2/quantization/server/conv_dnnlowp_op.cc: In instantiation of ‘void caffe2::ConvDNNLowPOp<T, ReluFused>::ConvNHWCCore_(const T*, std::vector*) [with T = unsigned char; bool ReluFused = false]’:
…/caffe2/quantization/server/conv_dnnlowp_op.cc:1752:16: required from here
…/caffe2/quantization/server/conv_dnnlowp_op.cc:1211:55: error: ‘depthwise_3x3x3_per_channel_quantization_pad_1’ was not declared in this scope
depthwise_3x3x3_per_channel_quantization_pad_1(

Any ideas what else I could try?

Thanks again.

ptrblck · October 15, 2020, 9:15am

After the branch checkout run:

git submodule update --init --recursive

Corey_Cole · October 18, 2020, 8:05pm

I’ve got a similar CPU (Westmere Xeon) and have the same problem. This appears to be PyTorch issue 43300 (https://github.com/pytorch/pytorch/issues/43300). I’ve tried the latest 1.7.0 nightly build and while I don’t get this problem, I get a different one that’s probably also AVX related.

FWIW, my new issue with the nightly is that NNPACK, while running nnp_initialize returns “unsupported hardware”

The exact error is:
[W NNPACK.cpp:80] Could not initialize NNPACK! Reason: Unsupported hardware.

For a CPU as old as yours and mine, I think we’re going to have to build from source to ensure that no AVX support sneaks in.

Corey_Cole · October 27, 2020, 8:00pm

I just tested the release of 1.7.0 on my Westmere Xeon and the NNPACK error is still present. I’ve thrown in the towel and have some AVX hardware to replace it, but I think anyone who wants to continue with pre-AVX hardware will need to build from source to avoid the SIGILL issues.

g-clef · November 1, 2020, 7:24pm

Thanks everyone for your help.

Fastai patched this weekend to support torch 1.7, and that seems to have been enough to support the master branch of pytorch. I manually built torch and torchvision from master, with ENABLE_NNPACK=0 set as an environment variable to avoid the “Unsupported Hardware” error, and the master branch of fastai. That seems to be working for me.