Dealing with "CUDA error: uncorrectable ECC error encountered"

I’ve recently started observing the errors of the following kind:

019-09-15 13:52:45,866 - log.py:60 - Failed
Traceback (most recent call last):
File “train.py”, line 770, in
main_loop()
File “train.py”, line 649, in main_loop
util.record(‘params’, torch.sum(util.flat_param(model)).item())
RuntimeError: CUDA error: uncorrectable ECC error encountered

Is it something that I could be causing, or should I blame it on cosmic rays?

How often do you see these errors?

I had two such crashes in one day, but have not seen it since then after rerunning a few times, so seems hardware related.

2 Likes