I am currently performing a multi-label classification. When working on a larger dataset, I am getting the message
Kernel have died it will restart automatically
Model Configuration:
class model_init(nn.Module):
def __init__(self):
super(model_init, self).__init__()
resnet18=tmodels.resnet18(pretrained=True)
resnet18_1=tmodels.resnet18(pretrained=True)
resnet18_2=tmodels.resnet18(pretrained=True)
self.m1=nn.Sequential(*(list(resnet18.children())[:-4]))
self.m2=nn.Sequential(*(list(resnet18_1.children())[-4:-1]))
self.m3=nn.Sequential(*(list(resnet18_2.children())[-4:-1]))
self.m2_classifier=nn.Linear(512, 500, bias=True)
self.m3_classifier=nn.Linear(512, 900, bias=True)
def forward(self,img):
x=self.m1(img)
m2_feat=torch.squeeze(self.m2(x))
m3_feat=torch.squeeze(self.m3(x))
m2_prob=self.m2_classifier(m2_feat)
m3_prob=self.m3_classifier(m3_feat)
return m2_prob,m3_prob
The code runs fine on smaller datasets with #label 1=125 #label 2=250. The larger dataset has #label 1=500 #label 2=900. I have monitored the GPU, RAM and SWAP memory usage of my RT
X A5000 machine (24 GB) with 128 GB RAM. The GPU usage remain below 70%. RAM and swap memory are also nominal. The code crashes on the very first batch forward pass. Also number of workers is set to 10 on a 20 core processor. Reducing batch size up to 2 does not solve the problem.
gdb --args python3 script.py
(gdb) run
Has the following output,
I have tried to localize where the problem is. But there is no such fixed location. The kernel sometime crashes on backward pass (loss.backward() step) and sometimes on the forward pass through the model.
Complete System Configuration:
Package Version
------------------------ -------------
absl-py 1.4.0
anyio 3.6.2
apturl 0.5.2
argon2-cffi 21.3.0
argon2-cffi-bindings 21.2.0
arrow 1.2.3
asttokens 2.2.1
attrs 22.2.0
backcall 0.2.0
bcrypt 3.2.0
beautifulsoup4 4.11.1
bellmanford 0.2.1
bleach 5.0.1
blinker 1.4
Brlapi 0.8.3
cachetools 5.3.0
certifi 2020.6.20
cffi 1.15.1
chardet 4.0.0
click 8.0.3
colorama 0.4.4
comm 0.1.2
command-not-found 0.3
contourpy 1.0.7
cryptography 3.4.8
cupshelpers 1.0
cycler 0.11.0
Cython 0.29.33
dbus-python 1.2.18
debugpy 1.6.5
decorator 4.4.2
defer 1.0.6
defusedxml 0.7.1
distro 1.7.0
distro-info 1.1build1
duplicity 0.8.21
einops 0.6.0
entrypoints 0.4
executing 1.2.0
fasteners 0.14.1
fastjsonschema 2.16.2
filelock 3.9.0
fonttools 4.38.0
fqdn 1.5.1
future 0.18.2
google-auth 2.16.2
google-auth-oauthlib 0.4.6
grpcio 1.51.3
httplib2 0.20.2
huggingface-hub 0.12.0
idna 3.3
importlib-metadata 4.6.4
ipykernel 6.20.2
ipython 8.8.0
ipython-genutils 0.2.0
isoduration 20.11.0
jedi 0.18.2
jeepney 0.7.1
Jinja2 3.1.2
joblib 1.2.0
jsonpointer 2.3
jsonschema 4.17.3
jupyter_client 7.4.9
jupyter_core 5.1.3
jupyter-events 0.6.3
jupyter_server 2.1.0
jupyter_server_terminals 0.4.4
jupyterlab-pygments 0.2.2
keyring 23.5.0
kiwisolver 1.4.4
language-selector 0.1
launchpadlib 1.10.16
lazr.restfulclient 0.14.4
lazr.uri 1.0.6
lockfile 0.12.2
louis 3.20.0
macaroonbakery 1.3.1
Mako 1.1.3
Markdown 3.4.1
MarkupSafe 2.1.2
matplotlib 3.6.3
matplotlib-inline 0.1.6
mistune 2.0.4
monotonic 1.6
more-itertools 8.10.0
nbclassic 0.4.8
nbclient 0.7.2
nbconvert 7.2.8
nbformat 5.7.3
nest-asyncio 1.5.6
netifaces 0.11.0
networkx 2.5.1
notebook 6.5.2
notebook_shim 0.2.2
numpy 1.24.1
nvidia-cublas-cu11 11.10.3.66
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11 8.5.0.96
oauthlib 3.2.0
olefile 0.46
opencv-contrib-python 4.7.0.68
packaging 23.0
pandocfilters 1.5.0
paramiko 2.9.3
parso 0.8.3
pexpect 4.8.0
pickleshare 0.7.5
Pillow 9.0.1
pip 22.0.2
platformdirs 2.6.2
prometheus-client 0.15.0
prompt-toolkit 3.0.36
protobuf 4.22.0
psutil 5.9.4
ptyprocess 0.7.0
pure-eval 0.2.2
pyasn1 0.4.8
pyasn1-modules 0.2.8
pycairo 1.20.1
pycparser 2.21
pycups 2.0.1
Pygments 2.14.0
PyGObject 3.42.1
PyJWT 2.3.0
pymacaroons 0.13.0
PyNaCl 1.5.0
pyparsing 2.4.7
pyRFC3339 1.1
pyrsistent 0.19.3
python-apt 2.4.0
python-dateutil 2.8.2
python-debian 0.1.43ubuntu1
python-json-logger 2.0.4
pytz 2022.1
pyxdg 0.27
PyYAML 5.4.1
pyzmq 25.0.0
reportlab 3.6.8
requests 2.25.1
requests-oauthlib 1.3.1
rfc3339-validator 0.1.4
rfc3986-validator 0.1.1
rsa 4.9
scikit-learn 1.2.1
scipy 1.10.0
screen-resolution-extra 0.0.0
SecretStorage 3.3.1
Send2Trash 1.8.0
setuptools 59.6.0
six 1.16.0
sklearn 0.0.post1
sniffio 1.3.0
soupsieve 2.3.2.post1
ssh-import-id 5.11
stack-data 0.6.2
systemd-python 234
tensorboard 2.12.0
tensorboard-data-server 0.7.0
tensorboard-plugin-wit 1.8.1
terminado 0.17.1
threadpoolctl 3.1.0
timm 0.6.12
tinycss2 1.2.1
torch 1.13.1
torchaudio 0.13.1
torchvision 0.14.1
tornado 6.2
tqdm 4.64.1
traitlets 5.8.1
typing_extensions 4.4.0
ubuntu-advantage-tools 27.12
ubuntu-drivers-common 0.0.0
ufw 0.36.1
unattended-upgrades 0.1
uri-template 1.2.0
urllib3 1.26.5
usb-creator 0.3.7
wadllib 1.3.6
wcwidth 0.2.6
webcolors 1.12
webencodings 0.5.1
websocket-client 1.4.2
Werkzeug 2.2.3
wheel 0.37.1
xdg 5
xkit 0.0.0
zipp 1.0.0
Note that I have already explored solutions at the following links:
Kernel have died it will restart automatically
The kernel appears to have died. It will restart automatically
"The kernel appears to have died"- Segmentation fault
Kernel dies on loss.backward()