Random core dumps and segmentation fault

On a single machine I started getting random segmentation faults which causes a core dump. I have seen some segmentation faults in other scenarios which were caused by invalid array indexing in the C++ backbone. I don’t think my problem is caused by this though because it is happening randomly in stably training models. I have seen it happen the same way while training many different models too so I am sure it is not something to do with a specific implementation. Here is an example of the output I am seeing…

d=1.01e+7, loss=2.15e+10]
| 8384/50000 [03:29<17:17, 40.11it/s, bpd=9.33, loss=1.99e+4]Segmentation fault (core dumped)

that is a tqdm progress bar just followed with the core dump and no error message. I am not sure how to diagnose what is happening here. Could this be a hardware issue?


Pytorch: 1.6.0
OS: Ubuntu 20.04
GPU: GeForce GTX 1080TI 11GB

Might be the case, but I would start by trying to isolate a particular software setup which might be causing this issue.
E.g. could you run the same workload in an Ubuntu18.04 docker container and check, if the behavior stays the same?
If so, could you run it with:

gdb --args python script.py args

and post the backtrace here?