RuntimeError: cuda runtime error (214) : uncorrectable ECC error encountered at /pytorch/aten/src/THC/generic/THCTensorMathPointwise.cu:207

I have 4 V100 gpus on my machine. When I run with CUDA_VISIBLE_DEVICES=1(3) python3 train.py it works. But CUDA_VISIBLE_DEVICES=0(2) python3 train.py it raises error:

RuntimeError: cuda runtime error (214) : uncorrectable ECC error encountered at /pytorch/aten/src/THC/generic/THCTensorMathPointwise.cu:207

It seems only work on the 2nd and 4th devices but fails on the 1st and 3rd devices.

Environment:

  • CUDA 10.1
  • PyTorch 1.3.1
  • NVIDIA DRIVER 418.116.00

How to fix this problem?

This error refers to a hardware failure.
How often do you see these errors and are these devices warmer than the other ones?

Almost every time. The 1st and 3rd devices are not warmer than others.

Could you try to use the 1st and 3rd device in another workstation and run some tests to see, if you still see these memory errors?