Segfault while running BERT

_chloe · June 16, 2022, 6:15pm

I’m using

Ubuntu 18.04 with GeForce 1080Ti
Docker 20.10.17
(docker image from ufoym/deepo)
Python 3.8.9
Tensorflow 2.8.0
Torch 1.12.0.dev20220327+cu113

trying to run bert-base-uncased model. I cannot do the parallel running (bert has a bug so I have to downgrade to 1.4 for parallel running), so it was running on 1 GPU.
It stoped after 1 epoch completes (after almost 20 hours of running), and found out it’s because of gnome segfault error.

Segfault at 0 ip 00007fd1c1c14461 sp 00007ffee88ea458 error 4 in libc-2.27.so[7fd1c1a86000+1e7000]

Is there any reason why or any way to fix?

ptrblck · June 20, 2022, 3:37am

Could you check the backtrace via gdb and post it here, please?