Issue: Kernel dies when fitting Pytorch model Linux Mint


#2

Could you export the notebook as a script and run it in your terminal?
This will most likely return an error message instead of just a kernel restart.


(Artemiy) #3

Like this?
Ok,I have this err,but what the?
There is no any errors on Win
1


#4

I assume you are running and editing your notebook in a browser. You can export it via:

File -> Download as -> Python (.py)

Alternatively you might use

jupyter nbconvert --to script your_notebook.ipynb

(Artemiy) #5

Translate from russian - Invalid instruction (the memory stack is flushed to disk)


#6

Could you run your script with pdb to get the stack trace?
The error message would probably translate to illegal instruction (core dumped).


(Artemiy) #7

I’m sorry for so much screenshoots, but here everything






(colesbury) #8

This sounds similar to Unable to sum the result of an equality test.

Do you know what model CPU you have on the Linux Mint machine?


(colesbury) #9

Also, it looks like you have both the nightly PyTorch build (0.5.0a0) and PyTorch-CPU (0.4.1) installed. I’m not sure which version you are running. Can you uninstall the older pytorch-cpu build?

conda uninstall pytorch-cpu

(Artemiy) #10

cpu - AMD A6-6310

after I uninstalled pytorch-cpu python can’t found torch.nn


(Artemiy) #11

I read the topic about Unable to sum the result of an equality test
So on my cpu pytorch can not be started?


(colesbury) #13

Your CPU should be OK. It looks like there is a bug in PyTorch, but I am not sure which PyTorch version you are using.

Please try the following. First fully uninstall PyTorch:

conda uninstall -y pytorch-cpu
conda uninstall -y pytorch

Next try the nightly CPU build from yesterday:

pip install https://download.pytorch.org/whl/nightly/cpu/torch_nightly-2018.8.14.dev1-cp36-cp36m-linux_x86_64.whl

Please let me know if this works.


(Artemiy) #15

Unfortunately, it didn’t help.Again the same error


(colesbury) #16

Can you try running your script under gdb and report the backtrace?

$ gdb --args python my_script.py
...
Reading symbols from python...done.
(gdb) run
...
(gdb) backtrace
...

Simple test for mixed precision on RTX 2070?
A potential error of torch.zeros function
How to get pytorch C++ crash callstack?
(Artemiy) #17




(colesbury) #18

Thanks, this is very helpful. Can you also run disas and report the output?

$ gdb --args python my_script.py
...
Reading symbols from python...done.
(gdb) run
...
(gdb) disas
...

(Artemiy) #19


(colesbury) #20

OK, it looks like the FMA4 vmfaddps instruction is the problem. I’m a bit confused because your CPU should support that instruction.

  1. Can you report the cpu flags: grep flags < /proc/cpuinfo
  2. Can you report the kernel version: uname -a
  3. Are you running in a virtual machine?

(Artemiy) #21

About VM - no, I don’t


(Artemiy) #22

so? Any results for this problem?


(colesbury) #23

No. I’m still not sure if your CPU is supposed to support the FMA4 instructions. I see conflicting information on AMD’s website. I’m asking AMD engineers about this. In the meantime, can you download, compile, and run this program which prints out information about your CPU:

curl https://gist.githubusercontent.com/colesbury/68ce5ededb6b48998af8a41ca326246b/raw/af8ff7d685042f62e9ac4d71047d423ab5ed2569/cpuid.c > cpuid.c
gcc cpuid.c
./a.out