Freezing training without no reason

I use PyTorch to train a sentence transformer model. The training randomly freeze after it started without no reason. Sometimes it hit to segmentation fault but most of the time the screen session just froze without any clue. The code uses a single RTX 4090 GPU while just half of it was filled and the memory consumed is just around 7GB. I am busy with this bug for more than a month and I tried so many solution but non of them work out. From recreating conda env to downgrade of cuda, PyTorch, torchvision, and installation of other Linux distribution was examined and non of them work out. I appreciate those who can give a hand to me.

My server configuration:

CPU: 24 core
Memory: ~ 64 GB
GPU: 2x GeForce RTX 4090
Available disk: 922 GB
OS: Windows Server/WSL
CUDA Version: 11.8


| NVIDIA-SMI 520.56.05    Driver Version: 522.25       CUDA Version: 11.8     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  Off |
| 30%   51C    P2   179W / 450W |  12391MiB / 24564MiB |     40%      Default |
|                               |                      |                  N/A |
|   1  NVIDIA GeForce ...  On   | 00000000:08:00.0  On |                  Off |
|  0%   23C    P8     9W / 450W |    345MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |

| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|    0   N/A  N/A        23      G   /Xwayland                       N/A      |
|    1   N/A  N/A        23      G   /Xwayland                       N/A      |

conda info:

     active environment : search
    active env location : /home/anaconda3/envs/search
            shell level : 2
       user config file : /home/rezam/.condarc
 populated config files : /home/rezam/.condarc
          conda version : 4.12.0
    conda-build version : 3.21.8
         python version :
       virtual packages : __linux=
       base environment : /home/anaconda3  (writable)
      conda av data dir : /home/anaconda3/etc/conda
  conda av metadata url : None
           channel URLs :
          package cache : /home/anaconda3/pkgs
       envs directories : /home/anaconda3/envs
               platform : linux-64
             user-agent : conda/4.12.0 requests/2.27.1 CPython/3.9.12 Linux/ ubuntu/22.04.1 glibc/2.35
                UID:GID : 1000:1000
             netrc file : None
           offline mode : False

conda list:

How full is your regular ram before it freezes?

And how are you loading your dataset?

regular ram ~ 7GB/62GB
GPU ram ~ 18GB/24GB and sometimes around 11GB/24GB
When I increase batch size, the freezing occurs much sooner and more frequently.
I use and class to load my dataset.

I see Doesn’t look like a ram issue. You’re using 2x 4090s with Cuda 11.8.

Perhaps post a bug report on the Pytorch Github page.

I have a similar problem with my single RTX 4090 setup and the problem happens with different models I am trying to train. Have you tried decreasing the num_worker=0. I’m willing to bet the freezes will go away. I’ve narrowed down the issue to file access within my dataset being the one thing that causes it and looks like the dataset workers are deadlocked. Not sure if its a docker, WSL or pytorch issue but nothing else I tried helped (and its severely limiting the speed at which I can train).