On a single machine I started getting random segmentation faults which causes a core dump. I have seen some segmentation faults in other scenarios which were caused by invalid array indexing in the C++ backbone. I don’t think my problem is caused by this though because it is happening randomly in stably training models. I have seen it happen the same way while training many different models too so I am sure it is not something to do with a specific implementation. Here is an example of the output I am seeing…
d=1.01e+7, loss=2.15e+10]
17%|██████████████████▌
| 8384/50000 [03:29<17:17, 40.11it/s, bpd=9.33, loss=1.99e+4]Segmentation fault (core dumped)
that is a tqdm progress bar just followed with the core dump and no error message. I am not sure how to diagnose what is happening here. Could this be a hardware issue?
Versions:
Pytorch: 1.6.0
OS: Ubuntu 20.04
GPU: GeForce GTX 1080TI 11GB