Error while running FINETUNING TORCHVISION MODELS

Sreekanth · July 27, 2021, 10:17am

Hi All,
I am new to torch and I am working on ML and tried the below example in Google colab which worked perfectly and prediction was working as expected.

https://pytorch.org/tutorials/beginner/finetuning_torchvision_models_tutorial.html

The same code I am trying to run on a Ubuntu 18.04 on a local machine without GPU, but fails with Illegal instruction (Core dump). Below is the image containing the error.

I have installed the latest torch library (1.9.0+cpu) and also tried with torch version 1.9.1 but still the error persists.

ptrblck · July 27, 2021, 9:43pm

Could you try to grab the stacktrace via:

gdb --args python script.py args
...
run
...
bt

and post it here?

PS: you can post code snippets by wrapping them into three backticks ```, which would make debugging easier.

Sreekanth · July 28, 2021, 5:09am

 **gdb --args python3 alexnet_working_code.ipynb  args**
GNU gdb (Ubuntu 8.1-0ubuntu3) 8.1.0.20180409-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<links releated to gdb>
Find the GDB manual and other documentation resources online at:
<links releated to gdb>
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python3...(no debugging symbols found)...done.
 **run**
Starting program: /usr/bin/python3 alexnet_working_code.ipynb args
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fffdd772700 (LWP 4307)]
[Thread 0x7fffdd772700 (LWP 4307) exited]
PyTorch Version:  1.9.0+cpu
[New Thread 0x7fffdd772700 (LWP 4329)]
AlexNet(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
    (1): ReLU(inplace=True)
    (2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (4): ReLU(inplace=True)
    (5): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (7): ReLU(inplace=True)
    (8): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): ReLU(inplace=True)
    (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(6, 6))
  (classifier): Sequential(
    (0): Dropout(p=0.5, inplace=False)
    (1): Linear(in_features=9216, out_features=4096, bias=True)
    (2): ReLU(inplace=True)
    (3): Dropout(p=0.5, inplace=False)
    (4): Linear(in_features=4096, out_features=4096, bias=True)
    (5): ReLU(inplace=True)
    (6): Linear(in_features=4096, out_features=15, bias=True)
  )
)
Initializing Datasets and Dataloaders...
/home/implant/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py:481: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  cpuset_checked))
cpu
cpu
Params to learn:
	 features.0.weight
	 features.0.bias
	 features.3.weight
	 features.3.bias
	 features.6.weight
	 features.6.bias
	 features.8.weight
	 features.8.bias
	 features.10.weight
	 features.10.bias
	 classifier.1.weight
	 classifier.1.bias
	 classifier.4.weight
	 classifier.4.bias
	 classifier.6.weight
	 classifier.6.bias
1
'2
Epoch 0/9
----------
[New Thread 0x7fffb0da1700 (LWP 4346)]
[New Thread 0x7fffabfff700 (LWP 4347)]
[New Thread 0x7fffab7fe700 (LWP 4348)]
[New Thread 0x7fffaaffd700 (LWP 4349)]
[New Thread 0x7fffaa7fc700 (LWP 4350)]
[New Thread 0x7fffa9ffb700 (LWP 4351)]
[New Thread 0x7fffa97fa700 (LWP 4352)]
[New Thread 0x7fffa8ff9700 (LWP 4353)]
/home/implant/.local/lib/python3.6/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
[New Thread 0x7fff8ad9f700 (LWP 4372)]
    
Thread 3 "python3" received signal SIGILL, Illegal instruction.
[Switching to Thread 0x7fffdd772700 (LWP 4329)]
0x00007fffeea7369d in void dnnl::impl::cpu::(anonymous namespace)::block_ker<float, true, false>(long, long, long, float const*, long, float const*, long, float*, long, float, float, float*, bool) () from /home/implant/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so

**(gdb) bt**
#0  0x00007fffeea7369d in void dnnl::impl::cpu::(anonymous namespace)::block_ker<float, true, false>(long, long, long, float const*, long, float const*, long, float*, long, float, float, float*, bool) ()
   from /home/implant/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#1  0x00007fffeea73881 in void dnnl::impl::cpu::(anonymous namespace)::gemm_ithr<float, true, false>(long, long, long, float, float const*, long, float const*, long, float, float*, long, bool, float*) ()
   from /home/implant/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#2  0x00007fffeea778fe in dnnl_status_t dnnl::impl::cpu::ref_gemm<float>(char const*, char const*, long const*, long const*, long const*, float const*, float const*, long const*, float const*, long const*, float const*, float*, long const*, float const*) ()
   from /home/implant/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#3  0x00007fffee9aca87 in void dnnl::impl::parallel<dnnl::impl::cpu::gemm_convolution_bwd_weights_t::execute_backward_weights_ncsp(dnnl::impl::exec_ctx_t const&) const::{lambda(int, int)#1}>(int, dnnl::impl::cpu::gemm_convolution_bwd_weights_t::execute_backward_weights_ncsp(dnnl::impl::exec_ctx_t const&) const::{lambda(int, int)#1}) [clone ._omp_fn.12] ()
   from /home/implant/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#4  0x00007ffff47a7405 in ?? () from /home/implant/.local/lib/python3.6/site-packages/torch/lib/libgomp-a34b3233.so.1
#5  0x00007ffff77cc6db in start_thread (arg=0x7fffdd772700) at pthread_create.c:463
#6  0x00007ffff7b05a3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

This is what i got after excuting those codes.Can you please look into it.

ptrblck · July 28, 2021, 5:11am

Thanks for the stacktrace. It seems that oneDNN is causing the failure. Could you create an issue on GitHub so that the code owners could take a look at it, please?

Sreekanth · July 28, 2021, 5:17am

Sure. Will create an issue in GitHub