Freezing training without no reason

I use PyTorch to train a sentence transformer model. The training randomly freeze after it started without no reason. Sometimes it hit to segmentation fault but most of the time the screen session just froze without any clue. The code uses a single RTX 4090 GPU while just half of it was filled and the memory consumed is just around 7GB. I am busy with this bug for more than a month and I tried so many solution but non of them work out. From recreating conda env to downgrade of cuda, PyTorch, torchvision, and installation of other Linux distribution was examined and non of them work out. I appreciate those who can give a hand to me.

My server configuration:

CPU: 24 core
Memory: ~ 64 GB
GPU: 2x GeForce RTX 4090
Available disk: 922 GB
OS: Windows Server/WSL
CUDA Version: 11.8

nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.56.05    Driver Version: 522.25       CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  Off |
| 30%   51C    P2   179W / 450W |  12391MiB / 24564MiB |     40%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  On   | 00000000:08:00.0  On |                  Off |
|  0%   23C    P8     9W / 450W |    345MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A        23      G   /Xwayland                       N/A      |
|    1   N/A  N/A        23      G   /Xwayland                       N/A      |
+-----------------------------------------------------------------------------+

conda info:

     active environment : search
    active env location : /home/anaconda3/envs/search
            shell level : 2
       user config file : /home/rezam/.condarc
 populated config files : /home/rezam/.condarc
          conda version : 4.12.0
    conda-build version : 3.21.8
         python version : 3.9.12.final.0
       virtual packages : __linux=5.15.79.1=0
                          __glibc=2.35=0
                          __unix=0=0
                          __archspec=1=x86_64
       base environment : /home/anaconda3  (writable)
      conda av data dir : /home/anaconda3/etc/conda
  conda av metadata url : None
           channel URLs : https://conda.anaconda.org/pytorch/linux-64
                          https://conda.anaconda.org/pytorch/noarch
                          https://repo.anaconda.com/pkgs/main/linux-64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/r/linux-64
                          https://repo.anaconda.com/pkgs/r/noarch
          package cache : /home/anaconda3/pkgs
                          /home/rezam/.conda/pkgs
       envs directories : /home/anaconda3/envs
                          /home/rezam/.conda/envs
               platform : linux-64
             user-agent : conda/4.12.0 requests/2.27.1 CPython/3.9.12 Linux/5.15.79.1-microsoft-standard-WSL2 ubuntu/22.04.1 glibc/2.35
                UID:GID : 1000:1000
             netrc file : None
           offline mode : False

conda list:

# packages in environment at /home/anaconda3/envs/search:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main
_openmp_mutex             5.1                       1_gnu
anyio                     3.6.2                    pypi_0    pypi
bzip2                     1.0.8                h7b6447c_0
ca-certificates           2023.01.10           h06a4308_0
certifi                   2022.12.7       py310h06a4308_0
charset-normalizer        3.0.1                    pypi_0    pypi
elastic-transport         8.4.0                    pypi_0    pypi
elasticsearch             8.6.1                    pypi_0    pypi
fastapi                   0.89.1                   pypi_0    pypi
filelock                  3.9.0                    pypi_0    pypi
hazm                      0.7.1                    pypi_0    pypi
huggingface-hub           0.12.0                   pypi_0    pypi
idna                      3.4                      pypi_0    pypi
joblib                    1.2.0                    pypi_0    pypi
ld_impl_linux-64          2.38                 h1181459_1
libffi                    3.4.2                h6a678d5_6
libgcc-ng                 11.2.0               h1234567_1
libgomp                   11.2.0               h1234567_1
libstdcxx-ng              11.2.0               h1234567_1
libuuid                   1.41.5               h5eee18b_0
libwapiti                 0.2.1                    pypi_0    pypi
ncurses                   6.4                  h6a678d5_0
nltk                      3.4                      pypi_0    pypi
numpy                     1.24.1                   pypi_0    pypi
nvidia-cublas-cu11        11.10.3.66               pypi_0    pypi
nvidia-cuda-nvrtc-cu11    11.7.99                  pypi_0    pypi
nvidia-cuda-runtime-cu11  11.7.99                  pypi_0    pypi
nvidia-cudnn-cu11         8.5.0.96                 pypi_0    pypi
openssl                   1.1.1s               h7f8727e_0
packaging                 23.0                     pypi_0    pypi
pandas                    1.5.3                    pypi_0    pypi
pillow                    9.4.0                    pypi_0    pypi
pip                       22.3.1          py310h06a4308_0
pyarrow                   11.0.0                   pypi_0    pypi
pydantic                  1.10.4                   pypi_0    pypi
python                    3.10.9               h7a1cb2a_0
python-dateutil           2.8.2                    pypi_0    pypi
pytz                      2022.7.1                 pypi_0    pypi
pyyaml                    6.0                      pypi_0    pypi
readline                  8.2                  h5eee18b_0
regex                     2022.10.31               pypi_0    pypi
requests                  2.28.2                   pypi_0    pypi
scikit-learn              1.2.1                    pypi_0    pypi
scipy                     1.10.0                   pypi_0    pypi
sentence-transformers     2.2.2                    pypi_0    pypi
sentencepiece             0.1.97                   pypi_0    pypi
setuptools                65.6.3          py310h06a4308_0
singledispatch            4.0.0                    pypi_0    pypi
six                       1.16.0                   pypi_0    pypi
sniffio                   1.3.0                    pypi_0    pypi
sqlite                    3.40.1               h5082296_0
starlette                 0.22.0                   pypi_0    pypi
threadpoolctl             3.1.0                    pypi_0    pypi
tk                        8.6.12               h1ccaba5_0
tokenizers                0.13.2                   pypi_0    pypi
torch                     1.13.1                   pypi_0    pypi
torchvision               0.14.1                   pypi_0    pypi
tqdm                      4.64.1                   pypi_0    pypi
transformers              4.26.0                   pypi_0    pypi
typing-extensions         4.4.0                    pypi_0    pypi
tzdata                    2022g                h04d1e81_0
urllib3                   1.26.14                  pypi_0    pypi
wheel                     0.37.1             pyhd3eb1b0_0
xz                        5.2.10               h5eee18b_1
zlib                      1.2.13               h5eee18b_0

How full is your regular ram before it freezes?

And how are you loading your dataset?

regular ram ~ 7GB/62GB
GPU ram ~ 18GB/24GB and sometimes around 11GB/24GB
When I increase batch size, the freezing occurs much sooner and more frequently.
I use torch.utils.data.Dataset and torch.utils.data.DataLoader class to load my dataset.

I see Doesn’t look like a ram issue. You’re using 2x 4090s with Cuda 11.8.

Perhaps post a bug report on the Pytorch Github page.

I have a similar problem with my single RTX 4090 setup and the problem happens with different models I am trying to train. Have you tried decreasing the num_worker=0. I’m willing to bet the freezes will go away. I’ve narrowed down the issue to file access within my dataset being the one thing that causes it and looks like the dataset workers are deadlocked. Not sure if its a docker, WSL or pytorch issue but nothing else I tried helped (and its severely limiting the speed at which I can train).