How to solve pytorch freezing issue?

i am trying to train tf_efficientnet_b0_ns model from popular timm repository for kaggle g2net gravitation wave detection challenge : G2Net Gravitational Wave Detection | Kaggle

during training i checked nvidia-smi and i was using around 19gb vram out of 24(i am using rtx 3090)
my problem is during 7th epoch computer freezes and i am forced to restart the pc.
here is the code and log attached : torch_freezing_pc - Pastebin.com
how to troubleshoot this annoying freezing issue?
thank you

Once the code freezes you could attach gdb to the process and check the backtrace to see where the code is frozen.

@ptrblck how to use gdb and backtrace?i never used it,sorry.i checked kern log and found nothing problematic,is there any memory leak in my code?
or is it related to pytorch version?
i am using torch ‘1.10.0.dev20210623+cu111’
i am using these packages :

Name Version Build Channel

_libgcc_mutex 0.1 main
_openmp_mutex 4.5 1_gnu
addict 2.4.0 pypi_0 pypi
alabaster 0.7.12 py_0 conda-forge
albumentations 1.0.3 pypi_0 pypi
alsa-lib 1.2.3 h516909a_0 conda-forge
appdirs 1.4.4 pyh9f0ad1d_0 conda-forge
argh 0.26.2 pyh9f0ad1d_1002 conda-forge
astroid 2.6.0 py38h578d9bd_0 conda-forge
async_generator 1.10 py_0 conda-forge
asynctest 0.13.0 pypi_0 pypi
atomicwrites 1.4.0 pyh9f0ad1d_0 conda-forge
attrs 21.2.0 pyhd8ed1ab_0 conda-forge
autopep8 1.5.6 pyhd8ed1ab_0 conda-forge
babel 2.9.1 pyh44b312d_0 conda-forge
backcall 0.2.0 pyhd3eb1b0_0
black 21.5b2 pyhd8ed1ab_0 conda-forge
bleach 3.3.0 pyh44b312d_0 conda-forge
brotlipy 0.7.0 py38h497a2fe_1001 conda-forge
ca-certificates 2021.5.30 ha878542_0 conda-forge
certifi 2021.5.30 py38h578d9bd_0 conda-forge
cffi 1.14.5 py38ha65f79e_0 conda-forge
cfgv 3.3.0 pypi_0 pypi
chardet 4.0.0 py38h578d9bd_1 conda-forge
click 8.0.1 py38h578d9bd_0 conda-forge
cloudpickle 1.6.0 py_0 conda-forge
codecov 2.1.11 pypi_0 pypi
colorama 0.4.4 pyh9f0ad1d_0 conda-forge
configparser 5.0.2 pypi_0 pypi
coverage 5.5 pypi_0 pypi
cryptography 3.4.7 py38ha5dfef3_0 conda-forge
cudatoolkit 11.1.74 h6bb024c_0 nvidia
cycler 0.10.0 pypi_0 pypi
cython 0.29.23 pypi_0 pypi
dataclasses 0.8 pyhc8e2a94_1 conda-forge
dbus 1.13.6 h48d8840_2 conda-forge
decorator 4.4.2 pypi_0 pypi
defusedxml 0.7.1 pyhd8ed1ab_0 conda-forge
diff-match-patch 20200713 pyh9f0ad1d_0 conda-forge
dill 0.3.4 pypi_0 pypi
distlib 0.3.2 pypi_0 pypi
docker-pycreds 0.4.0 pypi_0 pypi
docutils 0.17.1 py38h578d9bd_0 conda-forge
entrypoints 0.3 pyhd8ed1ab_1003 conda-forge
expat 2.4.1 h9c3ff4c_0 conda-forge
filelock 3.0.12 pypi_0 pypi
flake8 3.9.2 pypi_0 pypi
fontconfig 2.13.1 hba837de_1005 conda-forge
freetype 2.10.4 h0708190_1 conda-forge
future 0.18.2 py38h578d9bd_3 conda-forge
gast 0.3.3 pypi_0 pypi
gettext 0.19.8.1 h0b5b191_1005 conda-forge
gitdb 4.0.7 pypi_0 pypi
gitpython 3.1.18 pypi_0 pypi
glib 2.68.3 h9c3ff4c_0 conda-forge
glib-tools 2.68.3 h9c3ff4c_0 conda-forge
googleapis-common-protos 1.53.0 pypi_0 pypi
grad-cam 1.3.1 pypi_0 pypi
grpcio 1.32.0 pypi_0 pypi
gst-plugins-base 1.18.4 hf529b03_2 conda-forge
gstreamer 1.18.4 h76c114f_2 conda-forge
h5py 2.10.0 pypi_0 pypi
helpdev 0.7.1 pyhd8ed1ab_0 conda-forge
icu 68.1 h58526e2_0 conda-forge
identify 2.2.10 pypi_0 pypi
idna 2.10 pyh9f0ad1d_0 conda-forge
imageio 2.9.0 pypi_0 pypi
imagesize 1.2.0 py_0 conda-forge
imgaug 0.4.0 pypi_0 pypi
importlib-metadata 4.5.0 py38h578d9bd_0 conda-forge
importlib-resources 5.2.2 pypi_0 pypi
importlib_metadata 4.5.0 hd8ed1ab_0 conda-forge
iniconfig 1.1.1 pypi_0 pypi
intervaltree 3.0.2 py_0 conda-forge
ipykernel 5.5.5 py38hd0cf306_0 conda-forge
ipython 7.18.1 py38h5ca1d4c_0 anaconda
ipython_genutils 0.2.0 pyhd3eb1b0_1
ipywidgets 7.6.3 pypi_0 pypi
isort 5.8.0 pypi_0 pypi
jedi 0.17.2 py38h578d9bd_1 conda-forge
jeepney 0.6.0 pyhd8ed1ab_0 conda-forge
jinja2 3.0.1 pyhd8ed1ab_0 conda-forge
jpeg 9d h36c2ea0_0 conda-forge
jsonschema 3.2.0 pyhd8ed1ab_3 conda-forge
jupyter_client 6.1.12 pyhd8ed1ab_0 conda-forge
jupyter_core 4.7.1 py38h578d9bd_0 conda-forge
jupyterlab-widgets 1.0.0 pypi_0 pypi
jupyterlab_pygments 0.1.2 pyh9f0ad1d_0 conda-forge
kaggle 1.5.12 pypi_0 pypi
kaggledatasets 0.0.1 pypi_0 pypi
keyring 23.0.1 py38h578d9bd_0 conda-forge
kiwisolver 1.3.1 pypi_0 pypi
krb5 1.19.1 hcc1bbae_0 conda-forge
kwarray 0.5.19 pypi_0 pypi
lanms-proper 1.0.1 pypi_0 pypi
lazy-object-proxy 1.6.0 py38h497a2fe_0 conda-forge
ld_impl_linux-64 2.35.1 h7274673_9
libclang 11.1.0 default_ha53f305_1 conda-forge
libedit 3.1.20191231 he28a2e2_2 conda-forge
libevent 2.1.10 hcdb4288_3 conda-forge
libffi 3.3 he6710b0_2
libgcc-ng 9.3.0 h5101ec6_17
libglib 2.68.3 h3e27bee_0 conda-forge
libgomp 9.3.0 h5101ec6_17
libiconv 1.16 h516909a_0 conda-forge
libllvm11 11.1.0 hf817b99_2 conda-forge
libogg 1.3.4 h7f98852_1 conda-forge
libopus 1.3.1 h7f98852_1 conda-forge
libpng 1.6.37 h21135ba_2 conda-forge
libpq 13.3 hd57d9b9_0 conda-forge
libsodium 1.0.18 h36c2ea0_1 conda-forge
libspatialindex 1.9.3 h9c3ff4c_3 conda-forge
libstdcxx-ng 9.3.0 hd4cf53a_17
libuuid 2.32.1 h7f98852_1000 conda-forge
libvorbis 1.3.7 h9c3ff4c_0 conda-forge
libxcb 1.13 h7f98852_1003 conda-forge
libxkbcommon 1.0.3 he3ba5ed_0 conda-forge
libxml2 2.9.12 h72842e0_0 conda-forge
llvmlite 0.36.0 pypi_0 pypi
lmdb 1.2.1 pypi_0 pypi
lz4-c 1.9.3 h9c3ff4c_0 conda-forge
markupsafe 2.0.1 py38h497a2fe_0 conda-forge
matplotlib 3.4.2 pypi_0 pypi
mccabe 0.6.1 pypi_0 pypi
mistune 0.8.4 py38h497a2fe_1003 conda-forge
mmcv-full 1.3.7 dev_0
mmdet 2.11.0 pypi_0 pypi
mmocr 0.2.0 dev_0
mmpycocotools 12.0.3 pypi_0 pypi
mypy_extensions 0.4.3 py38h578d9bd_3 conda-forge
mysql-common 8.0.25 ha770c72_2 conda-forge
mysql-libs 8.0.25 hfa10184_2 conda-forge
nbclient 0.5.3 pyhd8ed1ab_0 conda-forge
nbconvert 6.0.7 pypi_0 pypi
nbformat 5.1.3 pyhd8ed1ab_0 conda-forge
ncurses 6.2 he6710b0_1
nest-asyncio 1.5.1 pyhd8ed1ab_0 conda-forge
networkx 2.5.1 pypi_0 pypi
nnaudio 0.2.5 pypi_0 pypi
nodeenv 1.6.0 pypi_0 pypi
nspr 4.30 h9c3ff4c_0 conda-forge
nss 3.64 hb5efdd6_0 conda-forge
numba 0.53.1 pypi_0 pypi
numpy 1.19.5 pypi_0 pypi
numpydoc 1.1.0 py_1 conda-forge
oauthlib 3.1.1 pypi_0 pypi
opencv-python 4.5.2.54 pypi_0 pypi
opencv-python-headless 4.5.3.56 pypi_0 pypi
openssl 1.1.1k h7f98852_0 conda-forge
ordered-set 4.0.2 pypi_0 pypi
packaging 20.9 pyh44b312d_0 conda-forge
pandas 1.3.0 pypi_0 pypi
pandoc 2.14.0.3 h7f98852_0 conda-forge
pandocfilters 1.4.3 pypi_0 pypi
parso 0.7.0 pyh9f0ad1d_0 conda-forge
pathspec 0.8.1 pyhd3deb0d_0 conda-forge
pathtools 0.1.2 pypi_0 pypi
pcre 8.45 h9c3ff4c_0 conda-forge
pexpect 4.8.0 pyhd3eb1b0_3
pickleshare 0.7.5 pyhd3eb1b0_1003
pillow 8.2.0 pypi_0 pypi
pip 21.1.2 py38h06a4308_0
pluggy 0.13.1 py38h578d9bd_4 conda-forge
polygon3 3.0.9.1 pypi_0 pypi
pre-commit 2.13.0 pypi_0 pypi
promise 2.3 pypi_0 pypi
prompt-toolkit 3.0.18 pypi_0 pypi
protobuf 3.17.3 pypi_0 pypi
psutil 5.8.0 py38h497a2fe_1 conda-forge
pthread-stubs 0.4 h36c2ea0_1001 conda-forge
ptyprocess 0.7.0 pyhd3eb1b0_2
py 1.10.0 pypi_0 pypi
pyclipper 1.2.1 pypi_0 pypi
pycocotools 2.0.2 pypi_0 pypi
pycodestyle 2.7.0 pypi_0 pypi
pycparser 2.20 pyh9f0ad1d_2 conda-forge
pydocstyle 6.1.1 pyhd8ed1ab_0 conda-forge
pyflakes 2.3.1 pypi_0 pypi
pygments 2.9.0 pyhd3eb1b0_0
pylint 2.8.2 pyhd8ed1ab_0 conda-forge
pyls-black 0.4.6 pyh9f0ad1d_0 conda-forge
pyls-spyder 0.3.2 pyhd8ed1ab_0 conda-forge
pyopenssl 20.0.1 pyhd8ed1ab_0 conda-forge
pyparsing 2.4.7 pyh9f0ad1d_0 conda-forge
pyqt 5.12.3 py38h578d9bd_7 conda-forge
pyqt-impl 5.12.3 py38h7400c14_7 conda-forge
pyqt5-sip 4.19.18 py38h709712a_7 conda-forge
pyqtchart 5.12 py38h7400c14_7 conda-forge
pyqtwebengine 5.12.1 py38h7400c14_7 conda-forge
pyrsistent 0.17.3 py38h497a2fe_2 conda-forge
pysocks 1.7.1 py38h578d9bd_3 conda-forge
pytest 6.2.4 pypi_0 pypi
pytest-cov 2.12.1 pypi_0 pypi
pytest-runner 5.3.1 pypi_0 pypi
python 3.8.5 h7579374_1
python-dateutil 2.8.1 py_0 conda-forge
python-jsonrpc-server 0.4.0 pyh9f0ad1d_0 conda-forge
python-language-server 0.36.2 pyhd8ed1ab_0 conda-forge
python-slugify 5.0.2 pypi_0 pypi
python_abi 3.8 2_cp38 conda-forge
pytz 2021.1 pyhd8ed1ab_0 conda-forge
pywavelets 1.1.1 pypi_0 pypi
pyxdg 0.27 pyhd8ed1ab_0 conda-forge
pyyaml 5.4.1 pypi_0 pypi
pyzmq 22.1.0 py38h2035c66_0 conda-forge
qdarkstyle 2.8.1 pyhd8ed1ab_2 conda-forge
qt 5.12.9 hda022c4_4 conda-forge
qtawesome 1.0.3 pyhd8ed1ab_0 conda-forge
qtconsole 5.1.0 pyhd8ed1ab_0 conda-forge
qtpy 1.9.0 py_0 conda-forge
qudida 0.0.4 pypi_0 pypi
rapidfuzz 1.4.1 pypi_0 pypi
readline 8.1 h27cfd23_0
regex 2021.4.4 py38h497a2fe_0 conda-forge
requests 2.25.1 pyhd3deb0d_0 conda-forge
rope 0.19.0 pyhd8ed1ab_0 conda-forge
rtree 0.9.7 py38h02d302b_1 conda-forge
scikit-image 0.18.1 pypi_0 pypi
scipy 1.6.3 pypi_0 pypi
seaborn 0.11.2 pypi_0 pypi
secretstorage 3.3.1 py38h578d9bd_0 conda-forge
send2trash 1.5.0 pypi_0 pypi
sentry-sdk 1.3.1 pypi_0 pypi
setuptools 52.0.0 py38h06a4308_0
shapely 1.7.1 pypi_0 pypi
shortuuid 1.0.1 pypi_0 pypi
six 1.16.0 pyh6c4a22f_0 conda-forge
smmap 4.0.0 pypi_0 pypi
snowballstemmer 2.1.0 pyhd8ed1ab_0 conda-forge
sortedcontainers 2.4.0 pyhd8ed1ab_0 conda-forge
sphinx 4.0.2 pyh6c4a22f_1 conda-forge
sphinxcontrib-applehelp 1.0.2 py_0 conda-forge
sphinxcontrib-devhelp 1.0.2 py_0 conda-forge
sphinxcontrib-htmlhelp 2.0.0 pyhd8ed1ab_0 conda-forge
sphinxcontrib-jsmath 1.0.1 py_0 conda-forge
sphinxcontrib-qthelp 1.0.3 py_0 conda-forge
sphinxcontrib-serializinghtml 1.1.5 pyhd8ed1ab_0 conda-forge
spyder 4.2.5 py38h578d9bd_0 conda-forge
spyder-kernels 1.10.2 py38h578d9bd_0 conda-forge
sqlite 3.35.4 hdfb4753_0
subprocess32 3.5.4 pypi_0 pypi
tensorboard 2.6.0 pypi_0 pypi
tensorflow-datasets 4.4.0 pypi_0 pypi
tensorflow-gpu 2.4.0 pypi_0 pypi
tensorflow-metadata 1.2.0 pypi_0 pypi
terminado 0.10.0 pypi_0 pypi
terminaltables 3.1.0 pypi_0 pypi
testpath 0.5.0 pyhd8ed1ab_0 conda-forge
text-unidecode 1.3 pypi_0 pypi
textdistance 4.2.1 pyhd8ed1ab_0 conda-forge
three-merge 0.1.1 pyh9f0ad1d_0 conda-forge
tifffile 2021.6.6 pypi_0 pypi
timm 0.4.13 pypi_0 pypi
tk 8.6.10 hbc83047_0
toml 0.10.2 pyhd8ed1ab_0 conda-forge
torch 1.10.0.dev20210623+cu111 pypi_0 pypi
torchaudio 0.8.1 pypi_0 pypi
torchvision 0.11.0.dev20210623+cu111 pypi_0 pypi
tornado 6.1 py38h497a2fe_1 conda-forge
traitlets 5.0.5 pyhd3eb1b0_0
ttach 0.0.3 pypi_0 pypi
typed-ast 1.4.3 py38h497a2fe_0 conda-forge
typing_extensions 3.10.0.0 pyha770c72_0 conda-forge
ubelt 0.9.5 pypi_0 pypi
ujson 4.0.2 py38h709712a_0 conda-forge
urllib3 1.26.5 pyhd8ed1ab_0 conda-forge
virtualenv 20.4.7 pypi_0 pypi
wandb 0.12.1 pypi_0 pypi
watchdog 1.0.2 py38h578d9bd_1 conda-forge
wcwidth 0.2.5 py_0
webencodings 0.5.1 pypi_0 pypi
wheel 0.36.2 pyhd3eb1b0_0
widgetsnbextension 3.5.1 pypi_0 pypi
wrapt 1.12.1 py38h497a2fe_3 conda-forge
wurlitzer 2.1.0 py38h578d9bd_0 conda-forge
xdoctest 0.15.4 pypi_0 pypi
xorg-libxau 1.0.9 h7f98852_0 conda-forge
xorg-libxdmcp 1.1.3 h7f98852_0 conda-forge
xz 5.2.5 h7b6447c_0
yaml 0.2.5 h516909a_0 conda-forge
yapf 0.31.0 pyhd8ed1ab_0 conda-forge
zeromq 4.3.4 h9c3ff4c_0 conda-forge
zipp 3.4.1 pyhd8ed1ab_0 conda-forge
zlib 1.2.11 h7b6447c_3
zstd 1.5.0 ha95c52a_0 conda-forge

gdb attach pid should work as described here.

@ptrblck i track system metric using wandb during training and before freezing here are the results :

  1. https://i.ibb.co/Jzs0J2C/1.png
  2. https://i.ibb.co/cttKxNk/2.png
  3. https://i.ibb.co/q5xGTsN/3.png
  4. https://i.ibb.co/hWyW5bx/4.png

The graphs show a constant usage and I don’t see any hang or how this could be used for debugging the issue. Were you able to check the backtrace via gdb?

sorry i am not understanding how to use gdb tracking.is there any tutorial to follow? for example i want to train this model : G2Net / efficientnet_b7 / baseline [training] | Kaggle
what are the step by step procedure for gdb tracking? need a tutorial

This tutorial might be a good starter.

so i have my training code saved in trainer.py file and for gdb backtracking i am planning to execute this command from terminal :
gdb -batch --ex “run” -ex “bt” --args python trainer.py 2>&1 | grep -v ^“No stack.” > log.txt

it should train my model using trainer.py file and save log in log.txt file,am i on track? or i am making mistake? is the command right for saving backtracking log?