I use PyTorch to train a sentence transformer model. The training randomly freeze after it started without no reason. Sometimes it hit to segmentation fault but most of the time the screen session just froze without any clue. The code uses a single RTX 4090 GPU while just half of it was filled and the memory consumed is just around 7GB. I am busy with this bug for more than a month and I tried so many solution but non of them work out. From recreating conda env to downgrade of cuda, PyTorch, torchvision, and installation of other Linux distribution was examined and non of them work out. I appreciate those who can give a hand to me.
My server configuration:
CPU: 24 core
Memory: ~ 64 GB
GPU: 2x GeForce RTX 4090
Available disk: 922 GB
OS: Windows Server/WSL
CUDA Version: 11.8
nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.56.05 Driver Version: 522.25 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 Off | Off |
| 30% 51C P2 179W / 450W | 12391MiB / 24564MiB | 40% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... On | 00000000:08:00.0 On | Off |
| 0% 23C P8 9W / 450W | 345MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 23 G /Xwayland N/A |
| 1 N/A N/A 23 G /Xwayland N/A |
+-----------------------------------------------------------------------------+
conda info:
active environment : search
active env location : /home/anaconda3/envs/search
shell level : 2
user config file : /home/rezam/.condarc
populated config files : /home/rezam/.condarc
conda version : 4.12.0
conda-build version : 3.21.8
python version : 3.9.12.final.0
virtual packages : __linux=5.15.79.1=0
__glibc=2.35=0
__unix=0=0
__archspec=1=x86_64
base environment : /home/anaconda3 (writable)
conda av data dir : /home/anaconda3/etc/conda
conda av metadata url : None
channel URLs : https://conda.anaconda.org/pytorch/linux-64
https://conda.anaconda.org/pytorch/noarch
https://repo.anaconda.com/pkgs/main/linux-64
https://repo.anaconda.com/pkgs/main/noarch
https://repo.anaconda.com/pkgs/r/linux-64
https://repo.anaconda.com/pkgs/r/noarch
package cache : /home/anaconda3/pkgs
/home/rezam/.conda/pkgs
envs directories : /home/anaconda3/envs
/home/rezam/.conda/envs
platform : linux-64
user-agent : conda/4.12.0 requests/2.27.1 CPython/3.9.12 Linux/5.15.79.1-microsoft-standard-WSL2 ubuntu/22.04.1 glibc/2.35
UID:GID : 1000:1000
netrc file : None
offline mode : False
conda list:
# packages in environment at /home/anaconda3/envs/search:
#
# Name Version Build Channel
_libgcc_mutex 0.1 main
_openmp_mutex 5.1 1_gnu
anyio 3.6.2 pypi_0 pypi
bzip2 1.0.8 h7b6447c_0
ca-certificates 2023.01.10 h06a4308_0
certifi 2022.12.7 py310h06a4308_0
charset-normalizer 3.0.1 pypi_0 pypi
elastic-transport 8.4.0 pypi_0 pypi
elasticsearch 8.6.1 pypi_0 pypi
fastapi 0.89.1 pypi_0 pypi
filelock 3.9.0 pypi_0 pypi
hazm 0.7.1 pypi_0 pypi
huggingface-hub 0.12.0 pypi_0 pypi
idna 3.4 pypi_0 pypi
joblib 1.2.0 pypi_0 pypi
ld_impl_linux-64 2.38 h1181459_1
libffi 3.4.2 h6a678d5_6
libgcc-ng 11.2.0 h1234567_1
libgomp 11.2.0 h1234567_1
libstdcxx-ng 11.2.0 h1234567_1
libuuid 1.41.5 h5eee18b_0
libwapiti 0.2.1 pypi_0 pypi
ncurses 6.4 h6a678d5_0
nltk 3.4 pypi_0 pypi
numpy 1.24.1 pypi_0 pypi
nvidia-cublas-cu11 11.10.3.66 pypi_0 pypi
nvidia-cuda-nvrtc-cu11 11.7.99 pypi_0 pypi
nvidia-cuda-runtime-cu11 11.7.99 pypi_0 pypi
nvidia-cudnn-cu11 8.5.0.96 pypi_0 pypi
openssl 1.1.1s h7f8727e_0
packaging 23.0 pypi_0 pypi
pandas 1.5.3 pypi_0 pypi
pillow 9.4.0 pypi_0 pypi
pip 22.3.1 py310h06a4308_0
pyarrow 11.0.0 pypi_0 pypi
pydantic 1.10.4 pypi_0 pypi
python 3.10.9 h7a1cb2a_0
python-dateutil 2.8.2 pypi_0 pypi
pytz 2022.7.1 pypi_0 pypi
pyyaml 6.0 pypi_0 pypi
readline 8.2 h5eee18b_0
regex 2022.10.31 pypi_0 pypi
requests 2.28.2 pypi_0 pypi
scikit-learn 1.2.1 pypi_0 pypi
scipy 1.10.0 pypi_0 pypi
sentence-transformers 2.2.2 pypi_0 pypi
sentencepiece 0.1.97 pypi_0 pypi
setuptools 65.6.3 py310h06a4308_0
singledispatch 4.0.0 pypi_0 pypi
six 1.16.0 pypi_0 pypi
sniffio 1.3.0 pypi_0 pypi
sqlite 3.40.1 h5082296_0
starlette 0.22.0 pypi_0 pypi
threadpoolctl 3.1.0 pypi_0 pypi
tk 8.6.12 h1ccaba5_0
tokenizers 0.13.2 pypi_0 pypi
torch 1.13.1 pypi_0 pypi
torchvision 0.14.1 pypi_0 pypi
tqdm 4.64.1 pypi_0 pypi
transformers 4.26.0 pypi_0 pypi
typing-extensions 4.4.0 pypi_0 pypi
tzdata 2022g h04d1e81_0
urllib3 1.26.14 pypi_0 pypi
wheel 0.37.1 pyhd3eb1b0_0
xz 5.2.10 h5eee18b_1
zlib 1.2.13 h5eee18b_0