PyTorch on a Ubuntu 18.04.3 computer - CNN seems to crash the kernel

clived2 · October 13, 2019, 3:29am

I have been using PyTorch on my Ubuntu computer, latest anaconda3, Python 3.6.9, Pytorch 1.2.0.
Linux kernel 5.24, 4 gigs of ram, for self training/personal interest. No GPU, running on CPU only

Most of the programs work, running linear regressions, logistic regressions, ANN models, CNN and RNN models. However when running a CNN model, they all seem to crash - The kernel appears to have died. It will restart automatically. Everything else seems to work fine. I have googled this and it seems a common issue, but no suggested solution which works for me.

Any comments, suggestions, folks ?

Thank you

tom · October 13, 2019, 2:09pm

Edit: So I had read too much LWN when looking at this. The kernel from the headline seems to be the Jupyter kernel rather than the Linux kernel that I had in mind.

It would seem impossible for PyTorch to crash a fully functional system (except by stalling it via swapping) in contrast to just PyTorch being terminated.
As I would recommend to try other computationally intensive workloads to see if something is wrong with your computer.

Best regards

Thomas

clived2 · October 13, 2019, 4:15pm

Hi Tom, thanks for your response. It just crashes when I am running CNN training models with the message “the kernel appears to have died. It will restart automatically”. I use tensorflow on other other computationally intensive workloads, and they seem to work fine. I re-installed PyTorch to version 1.3.0 last night and the same thing thing keeps happening, except it seems that the script is still running on my hd. Anyway, I’ll keep experimenting , maybe its just because I am on an old comouter, a Pentium based one, which seems to run everything else.

Anyway Thanks, Tom

Clive

ptrblck · October 13, 2019, 5:20pm

Could you try to run your script in a terminal (export your notebook as a script, if necessary) and post the stack trace here, please?

clived2 · October 13, 2019, 8:20pm

Will do, Peter. Just trying to back up some of the stuff. Btw, I just ran the same script on my Win 10 laptop and it worked fine

clived2 · October 13, 2019, 11:17pm

https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html

import torch
import torchvision
import torchvision.transforms as transforms

transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4,
                                          shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=4,
                                         shuffle=False, num_workers=2)

classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

Files already downloaded and verified
Files already downloaded and verified

import matplotlib.pyplot as plt
import numpy as np

# functions to show an image


def imshow(img):
    img = img / 2 + 0.5     # unnormalize
    npimg = img.numpy()
    plt.imshow(np.transpose(npimg, (1, 2, 0)))
    plt.show()


# get some random training images
dataiter = iter(trainloader)
images, labels = dataiter.next()

# show images
imshow(torchvision.utils.make_grid(images))
# print labels
print(' '.join('%5s' % classes[labels[j]] for j in range(4)))

<Figure size 640x480 with 1 Axes>


 frog plane  deer   car

2. Define a Convolutional Neural Network

import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x


net = Net()

3. Define a Loss function and optimizer

import torch.optim as optim

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

4. Train the network

for epoch in range(2):  # loop over the dataset multiple times

    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if i % 2000 == 1999:    # print every 2000 mini-batches
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 2000))
            running_loss = 0.0

print('Finished Training')

The kernel appears to have died. It will restart automatically.


Here it is in Markdown format, bit of a mess, Peter. But everything worked until the training part where the kernel died. I might add that the same script ran perfectly on Google Colaboratory

ptrblck · October 16, 2019, 1:50am

Thanks for the code!
Could you nor rerun this code from a terminal via

python scripy.py

and check, if you get a valid stack stace? Jupyter kernels sometimes just get restarted without a proper error message (at least that’s my experience).

clived2 · October 16, 2019, 6:12am

peter, i’ll give it a shot, not sure if I undestand how to run your suggested script with my code ?

ptrblck · October 16, 2019, 1:50pm

script.py is just a dummy name for your script.
Just run the script in a terminal instead of a Jupyter notebook.

clived2 · October 16, 2019, 10:07pm

Gotcha. Peter, will do

clived2 · October 17, 2019, 2:36am

Hi peter
I took the original ipynb file and coded into a py file, ran it in spyder and had the same messesage re kernel restarting.

I then tried as suggested : python fifar10.py in a linux terminal and it just froze. I may just be because I am running it on an older (2009) desktop. it might be getting ready to die. I’m going to leave this for awhile.

Thanks for your help

Clive

clived2 · October 17, 2019, 4:11am

Peter, I manged to do a backtrace using GDB ? makes no sense to me but here it is

gdb --args python CIFAR_10.py

(gdb) run
Starting program: /home/clived/anaconda3/envs/tflow/bin/python CIFAR_10.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library “/lib/x86_64-linux-gnu/libthread_db.so.1”.
Files already downloaded and verified
[New Thread 0x7fffd6fa2780 (LWP 22610)]
Files already downloaded and verified
[New Thread 0x7fffcbf5d700 (LWP 22614)]
[New Thread 0x7fffcb75c700 (LWP 22615)]
[New Thread 0x7fffca315700 (LWP 22620)]
[New Thread 0x7fffb9b79700 (LWP 22621)]
^C
Thread 1 “python” received signal SIGINT, Interrupt.

(gdb) backtrace
#0 0x00007ffff78d9bf9 in _GI___poll (fds=0x7fffc00072d0, nfds=1, timeout=-1) at …/sysdeps/unix/sysv/linux/poll.c:29
#1 0x00007fffd4098cb3 in g_main_context_iterate.isra () from /home/clived/anaconda3/envs/tflow/lib/python3.6/site-packages/PyQt5/…/…/…/./libglib-2.0.so.0
#2 0x00007fffd4098dce in g_main_context_iteration () from /home/clived/anaconda3/envs/tflow/lib/python3.6/site-packages/PyQt5/…/…/…/./libglib-2.0.so.0
#3 0x00007fffd4b608df in QEventDispatcherGlib::processEvents(QFlagsQEventLoop::ProcessEventsFlag) ()
from /home/clived/anaconda3/envs/tflow/lib/python3.6/site-packages/PyQt5/…/…/…/libQt5Core.so.5
#4 0x00007fffd4b30eb7 in QEventLoop::exec(QFlagsQEventLoop::ProcessEventsFlag) ()
from /home/clived/anaconda3/envs/tflow/lib/python3.6/site-packages/PyQt5/…/…/…/libQt5Core.so.5
#5 0x00007fffd4b34aab in QCoreApplication::exec() () from /home/clived/anaconda3/envs/tflow/lib/python3.6/site-packages/PyQt5/…/…/…/libQt5Core.so.5
#6 0x00007fffcc875970 in meth_QApplication_exec () from /home/clived/anaconda3/envs/tflow/lib/python3.6/site-packages/PyQt5/QtWidgets.so
#7 0x0000555555665b91 in _PyCFunction_FastCallDict () at /tmp/build/80754af9/python_1564510748219/work/Objects/methodobject.c:234
#8 0x00005555556edabc in call_function () at /tmp/build/80754af9/python_1564510748219/work/Python/ceval.c:4851
#9 0x000055555571075a in _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1564510748219/work/Python/ceval.c:3335
#10 0x00005555556e7c5b in _PyFunction_FastCall (globals=, nargs=0, args=, co=)
at /tmp/build/80754af9/python_1564510748219/work/Python/ceval.c:4933
#11 fast_function () at /tmp/build/80754af9/python_1564510748219/work/Python/ceval.c:4968
#12 0x00005555556edb95 in call_function () at /tmp/build/80754af9/python_1564510748219/work/Python/ceval.c:4872
#13 0x000055555571075a in _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1564510748219/work/Python/ceval.c:3335
#14 0x00005555556e89b9 in _PyEval_EvalCodeWithName (qualname=0x0, name=, closure=0x0, kwdefs=0x0, defcount=1, defs=0x7fffd5a47728, kwstep=2,
kwcount=, kwargs=0x7ffff7f9c068, kwnames=0x7ffff7f9c060, argcount=, args=0x5555572f2410, locals=0x0, globals=,
_co=0x7fffd5a489c0) at /tmp/build/80754af9/python_1564510748219/work/Python/ceval.c:4166
#15 PyEval_EvalCodeEx () at /tmp/build/80754af9/python_1564510748219/work/Python/ceval.c:4187
#16 0x00005555556e98e6 in function_call () at /tmp/build/80754af9/python_1564510748219/work/Objects/funcobject.c:604
#17 0x0000555555665a5e in PyObject_Call () at /tmp/build/80754af9/python_1564510748219/work/Objects/abstract.c:2261
—Type to continue, or q to quit—
#18 0x0000555555711e37 in do_call_core (kwdict=0x7fffcbf91438, callargs=0x7fffc8f8b0f0, func=0x7fffd5a21a60)
at /tmp/build/80754af9/python_1564510748219/work/Python/ceval.c:5120
#19 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1564510748219/work/Python/ceval.c:3404
#20 0x00005555556e729e in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1564510748219/work/Python/ceval.c:4166
#21 0x00005555556e8598 in _PyFunction_FastCallDict () at /tmp/build/80754af9/python_1564510748219/work/Python/ceval.c:5084
#22 0x000055555566601f in _PyObject_FastCallDict () at /tmp/build/80754af9/python_1564510748219/work/Objects/abstract.c:2310
#23 0x000055555566aaa3 in _PyObject_Call_Prepend () at /tmp/build/80754af9/python_1564510748219/work/Objects/abstract.c:2373
#24 0x0000555555665a5e in PyObject_Call () at /tmp/build/80754af9/python_1564510748219/work/Objects/abstract.c:2261
#25 0x0000555555711e37 in do_call_core (kwdict=0x7fffcbf72480, callargs=0x7ffff7f9c048, func=0x7fffd5504e48)
at /tmp/build/80754af9/python_1564510748219/work/Python/ceval.c:5120
#26 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1564510748219/work/Python/ceval.c:3404
#27 0x00005555556e6e66 in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1564510748219/work/Python/ceval.c:4166
#28 0x00005555556e7ed6 in fast_function () at /tmp/build/80754af9/python_1564510748219/work/Python/ceval.c:4992
#29 0x00005555556edb95 in call_function () at /tmp/build/80754af9/python_1564510748219/work/Python/ceval.c:4872
#30 0x000055555571075a in _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1564510748219/work/Python/ceval.c:3335
#31 0x00005555556e7c5b in _PyFunction_FastCall (globals=, nargs=1, args=, co=)
at /tmp/build/80754af9/python_1564510748219/work/Python/ceval.c:4933
#32 fast_function () at /tmp/build/80754af9/python_1564510748219/work/Python/ceval.c:4968
#33 0x00005555556edb95 in call_function () at /tmp/build/80754af9/python_1564510748219/work/Python/ceval.c:4872
#34 0x000055555571075a in _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1564510748219/work/Python/ceval.c:3335
#35 0x00005555556e89b9 in _PyEval_EvalCodeWithName (qualname=0x0, name=, closure=0x0, kwdefs=0x0, defcount=0, defs=0x0, kwstep=2, kwcount=,
kwargs=0x0, kwnames=0x0, argcount=0, args=0x0, locals=0x7ffff7f531b0, globals=0x7ffff7f531b0, _co=0x7ffff63f5a50)
at /tmp/build/80754af9/python_1564510748219/work/Python/ceval.c:4166
—Type to continue, or q to quit—
#36 PyEval_EvalCodeEx () at /tmp/build/80754af9/python_1564510748219/work/Python/ceval.c:4187
#37 0x00005555556e975c in PyEval_EvalCode (co=co@entry=0x7ffff63f5a50, globals=globals@entry=0x7ffff7f531b0, locals=locals@entry=0x7ffff7f531b0)
at /tmp/build/80754af9/python_1564510748219/work/Python/ceval.c:731
#38 0x0000555555769744 in run_mod () at /tmp/build/80754af9/python_1564510748219/work/Python/pythonrun.c:1025
#39 0x0000555555769b41 in PyRun_FileExFlags () at /tmp/build/80754af9/python_1564510748219/work/Python/pythonrun.c:978
#40 0x0000555555769d43 in PyRun_SimpleFileExFlags () at /tmp/build/80754af9/python_1564510748219/work/Python/pythonrun.c:419
#41 0x0000555555769e4d in PyRun_AnyFileExFlags () at /tmp/build/80754af9/python_1564510748219/work/Python/pythonrun.c:81
#42 0x000055555576d833 in run_file (p_cf=0x7fffffffdc0c, filename=0x5555558a86c0 L"CIFAR_10.py", fp=0x55555593f120)
at /tmp/build/80754af9/python_1564510748219/work/Modules/main.c:340
#43 Py_Main () at /tmp/build/80754af9/python_1564510748219/work/Modules/main.c:811
#44 0x000055555563788e in main () at /tmp/build/80754af9/python_1564510748219/work/Programs/python.c:69
#45 0x00007ffff77e6b97 in __libc_start_main (main=0x5555556377a0 , argc=2, argv=0x7fffffffde18, init=, fini=,
rtld_fini=, stack_end=0x7fffffffde08) at …/csu/libc-start.c:310
#46 0x0000555555717160 in _start () at …/sysdeps/x86_64/elf/start.S:103
(gdb)

ptrblck · October 17, 2019, 5:55am

Thanks for the information. Since the script execution just hangs and you had to kill it via CTRL+C, you won’t get a proper backtrace.

Anyway, since the code runs fine on another machine, we should focus on your current setup.
If you are using conda, could you create a clean new virtual environment via

conda create -n pytorch_stable Python=3.7 Anaconda
...
conda activate pytorch_stable

And install the latest stable release via the command from the website?

Also, some specs about the laptop could be helpful.
E.g. which CPU is it using etc.
I’ve seen some issues with old CPUs, which didn’t support some instructions (I think it was related to AVX instructions, but am not sure).

clived2 · October 17, 2019, 8:56pm

Hi Peter. Thats a good idea, my computer is a Dell Optiplex 70 wiith the following specs:
OS: Ubuntu 18.04 (bionic) kernel = 5.2.14-050214-generic
CPU: Pentium® Dual-Core CPU E5200 @ 2.50GHz Frequency = 2227.079 MHz L2 cache = 2048 KB

The Anaconda3 install is using Python 3.7 but most of my dev work is on an env built for Python 3.6
I’ll go ahead with your suggestion and see what happens, I’ll let you know how things go

Thanks
Clive

clived2 · October 20, 2019, 12:35am

Hi Peter, with your help, I figured this out. I remembered way back that I started having this problem with training a CNN when the PyTorch 1.2.0 was introduced. Anyway I came across a line of code which seemed to implement the appropriate version of Torch and Torchvision to install in my case on my older Dell Optiplex desktop, which was PyTorch 1.1.0. The code is as follows:
(tflow) clived@clived-OptiPlex-760:~$ conda install pytorch-cpu torchvision-cpu -c pytorch.
I ran this in my tflow env (Python 3.6) and in the Python environment (Python 3.7) as you suggested, and in both of these situations, my training a CNN model worked fine.

Glad that this has been resolved, as I had this in my head for like the last two weeks.

Thank you

Clive