I have 4 V100 gpus on my machine. When I run with CUDA_VISIBLE_DEVICES=1(3) python3 train.py
it works. But CUDA_VISIBLE_DEVICES=0(2) python3 train.py
it raises error:
RuntimeError: cuda runtime error (214) : uncorrectable ECC error encountered at /pytorch/aten/src/THC/generic/THCTensorMathPointwise.cu:207
It seems only work on the 2nd and 4th devices but fails on the 1st and 3rd devices.
Environment:
- CUDA 10.1
- PyTorch 1.3.1
- NVIDIA DRIVER 418.116.00
How to fix this problem?