Sporadic crashes/Segmentation Faults

Hi all,
Lately I’ve been running into sporadic segmentation faults/crashes while training. (Sometimes there’s a segmentation fault message in dmesg & the training hangs and other times the system crashes). This was not an issue until recently as we had completed training with the same code and dataset many times before this started occurring. At first I thought it was potentially related to package updates, but I’m still seeing this issue from a clean environment with both pytorch 1.5.1, nightly, and 1.2.0.

System info:

  • Ubuntu 18.05
  • Geforce RTX 2080

I’ve scoured this forum + pytorch issues and tried the following recommended approaches:

  • setting num_workers to 0
  • setting pin_memory to false
  • Moving the training from using 2 GPUs to 1
  • working from a clean environment
  • Turning off data shuffling to see if it’s data related

I was looking to see if anyone had any other reccomendations for approaches to try.

Thank you for your time!

If you encounter system crashes suddenly without changing anything in the environment and source code, I would also consider some hardware issues, e.g. overheating or a weak PSU (or a faulty one).

Do you see any errors or is your terminal just printing the seg fault message and stops?
Could you check the temp. sensors and see how hot the system is before it crashes?

Thanks for taking the time to respond @ptrblck!

There’s never any messages printed to terminal in the running script, only occasionally in system logs (if the system doesn’t crash).

Some of the messages are as follows:

python[24663]: segfault at 7f0e9cd687b0 ip 00007f0e9cd687b0 sp 00007ffcf0727958 error 15
traps: python[11772] general protection ip:555555754fd6 sp:7fffffff40d0 error:0 in python3.7[5555555af000+1ce000]
[17326.371274] BUG: Bad page state in process python  pfn:e77b24
[17326.371278] page:ffffdd7539dec900 count:0 mapcount:-4096 mapping:0000000000000000 index:0x1
[17326.371279] flags: 0x17ffffc0000000()
[17326.371281] raw: 0017ffffc0000000 0000000000000000 0000000000000001 00000000ffffefff
[17326.371282] raw: dead000000000100 dead000000000200 0000000000000000 0000000000000000
[17326.371283] page dumped because: nonzero mapcount

Nvidia-smi didn’t show GPU temps as being out of bounds.

I will investigate the memory and PSU this upcoming week.

Thank you

The error 15 indicates an attempt to execute code from a mapped memory area that isn’t executable, while the general protection trap points to “extremely wild pointers”.
Searching for these kind of errors often points to an underlying kernel bug.

Which OS and kernel version are you using? Did you update recently?

Ubuntu 18.04.4 LTS

Minor versions of the kernel have been updated (4.15.0-106 => 4.15.0-109)

Could you use a docker container with Ubuntu 16.04 LTS just for the sake of debugging?
This could potentially isolate the error to the OS.

Thank you. Will attempt tomorrow

@ptrblck ran everything successfully for partial training on the default pytorch docker image (ubuntu 18.04). Going to run the full training tonight and hopefully it continues to work. Still unsure of what caused the issue, but a clean environment seems to help.

Thank you for all of your help!

1 Like