Segfault while running BERT

I’m using

  • Ubuntu 18.04 with GeForce 1080Ti
  • Docker 20.10.17
    (docker image from ufoym/deepo)
  • Python 3.8.9
  • Tensorflow 2.8.0
  • Torch 1.12.0.dev20220327+cu113

trying to run bert-base-uncased model. I cannot do the parallel running (bert has a bug so I have to downgrade to 1.4 for parallel running), so it was running on 1 GPU.
It stoped after 1 epoch completes (after almost 20 hours of running), and found out it’s because of gnome segfault error.

Segfault at 0 ip 00007fd1c1c14461 sp 00007ffee88ea458 error 4 in libc-2.27.so[7fd1c1a86000+1e7000]

Is there any reason why or any way to fix?

Could you check the backtrace via gdb and post it here, please?