Hi all,
Lately I’ve been running into sporadic segmentation faults/crashes while training. (Sometimes there’s a segmentation fault message in dmesg & the training hangs and other times the system crashes). This was not an issue until recently as we had completed training with the same code and dataset many times before this started occurring. At first I thought it was potentially related to package updates, but I’m still seeing this issue from a clean environment with both pytorch 1.5.1, nightly, and 1.2.0.
System info:
Ubuntu 18.05
Geforce RTX 2080
I’ve scoured this forum + pytorch issues and tried the following recommended approaches:
setting num_workers to 0
setting pin_memory to false
Moving the training from using 2 GPUs to 1
working from a clean environment
Turning off data shuffling to see if it’s data related
I was looking to see if anyone had any other reccomendations for approaches to try.
If you encounter system crashes suddenly without changing anything in the environment and source code, I would also consider some hardware issues, e.g. overheating or a weak PSU (or a faulty one).
Do you see any errors or is your terminal just printing the seg fault message and stops?
Could you check the temp. sensors and see how hot the system is before it crashes?
@ptrblck ran everything successfully for partial training on the default pytorch docker image (ubuntu 18.04). Going to run the full training tonight and hopefully it continues to work. Still unsure of what caused the issue, but a clean environment seems to help.