Issue: Kernel dies when fitting Pytorch model Linux Mint

ponafly · August 12, 2018, 11:45am

Issue description

Hello there.I’m using Jupyter Notebook, Linux Mint x64 and I have a huge problem with Pytorch. When I running this code - kernel dies.Perfect running on Win 8.1 x64,no problem. I tryed different types of installing - pip,conda, source code from github.Nothing. Please, explain me why this happen all times and how to fix it

Code example

import torch
from torch import nn
import torch.nn.functional as F
from notmnist import load_notmnist
X_train, y_train, X_test, y_test = load_notmnist(letters='AB')
X_train, X_test = X_train.reshape([-1, 784]), X_test.reshape([-1, 784])
model = nn.Sequential()
model.add_module('l1', nn.Linear(784, 1))
model.add_module('l2', nn.Sigmoid())

x = torch.tensor(X_train[:3], dtype=torch.float32)
y = torch.tensor(y_train[:3], dtype=torch.float32)
y_predicted = model(x)[:, 0]

System Info


PyTorch version: 0.5.0a0+b640264
Is debug build: No
CUDA used to build PyTorch: None

OS: Linux Mint 19 Tara
GCC version: (Ubuntu 7.3.0-16ubuntu3) 7.3.0
CMake version: version 3.10.2

Python version: 3.6
Is CUDA available: No
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA

Versions of relevant libraries:
[pip] Could not collect
[conda] pytorch-cpu 0.4.1 py36_cpu_1 pytorch
[conda] torch 0.5.0a0+b640264
[conda] torchvision-cpu 0.2.1 py36_1 pytorch

If you need any logs - please tell me where I can found it

ptrblck · August 13, 2018, 10:58am

Could you export the notebook as a script and run it in your terminal?
This will most likely return an error message instead of just a kernel restart.

ponafly · August 13, 2018, 3:09pm

Like this?
Ok,I have this err,but what the?
There is no any errors on Win

ptrblck · August 13, 2018, 3:23pm

I assume you are running and editing your notebook in a browser. You can export it via:

File -> Download as -> Python (.py)

Alternatively you might use

jupyter nbconvert --to script your_notebook.ipynb

ponafly · August 13, 2018, 3:53pm

Translate from russian - Invalid instruction (the memory stack is flushed to disk)

ptrblck · August 13, 2018, 4:00pm

Could you run your script with pdb to get the stack trace?
The error message would probably translate to illegal instruction (core dumped).

ponafly · August 13, 2018, 4:19pm

I’m sorry for so much screenshoots, but here everything

colesbury · August 13, 2018, 7:56pm

This sounds similar to Unable to sum the result of an equality test.

Do you know what model CPU you have on the Linux Mint machine?

colesbury · August 13, 2018, 9:31pm

Also, it looks like you have both the nightly PyTorch build (0.5.0a0) and PyTorch-CPU (0.4.1) installed. I’m not sure which version you are running. Can you uninstall the older pytorch-cpu build?

conda uninstall pytorch-cpu

ponafly · August 14, 2018, 3:01pm

cpu - AMD A6-6310

after I uninstalled pytorch-cpu python can’t found torch.nn

ponafly · August 15, 2018, 3:53pm

I read the topic about Unable to sum the result of an equality test
So on my cpu pytorch can not be started?

colesbury · August 15, 2018, 6:36pm

Your CPU should be OK. It looks like there is a bug in PyTorch, but I am not sure which PyTorch version you are using.

Please try the following. First fully uninstall PyTorch:

conda uninstall -y pytorch-cpu
conda uninstall -y pytorch

Next try the nightly CPU build from yesterday:

pip install https://download.pytorch.org/whl/nightly/cpu/torch_nightly-2018.8.14.dev1-cp36-cp36m-linux_x86_64.whl

Please let me know if this works.

ponafly · August 16, 2018, 8:20am

Unfortunately, it didn’t help.Again the same error

colesbury · August 16, 2018, 3:06pm

Can you try running your script under gdb and report the backtrace?

$ gdb --args python my_script.py
...
Reading symbols from python...done.
(gdb) run
...
(gdb) backtrace
...

ponafly · August 16, 2018, 4:01pm

colesbury · August 16, 2018, 4:32pm

Thanks, this is very helpful. Can you also run disas and report the output?

$ gdb --args python my_script.py
...
Reading symbols from python...done.
(gdb) run
...
(gdb) disas
...

ponafly · August 16, 2018, 5:00pm

colesbury · August 16, 2018, 6:22pm

OK, it looks like the FMA4 vmfaddps instruction is the problem. I’m a bit confused because your CPU should support that instruction.

Can you report the cpu flags: grep flags < /proc/cpuinfo
Can you report the kernel version: uname -a
Are you running in a virtual machine?

ponafly · August 17, 2018, 2:52am

About VM - no, I don’t

ponafly · August 20, 2018, 3:12pm

so? Any results for this problem?