I’m using
- Ubuntu 18.04 with GeForce 1080Ti
- Docker 20.10.17
(docker image from ufoym/deepo) - Python 3.8.9
- Tensorflow 2.8.0
- Torch 1.12.0.dev20220327+cu113
trying to run bert-base-uncased model. I cannot do the parallel running (bert has a bug so I have to downgrade to 1.4 for parallel running), so it was running on 1 GPU.
It stoped after 1 epoch completes (after almost 20 hours of running), and found out it’s because of gnome segfault error.
Segfault at 0 ip 00007fd1c1c14461 sp 00007ffee88ea458 error 4 in libc-2.27.so[7fd1c1a86000+1e7000]
Is there any reason why or any way to fix?