My second gpu "Tesla V100-PCIE-32GB" disappears after running the code of transformers or after some time of work

The error is:

[ 1361.908162] NVRM: GPU at PCI:0000:d8:00: GPU-d1a5f877-65cb-a62e-4192-ae05bb68fc48
[ 1361.908175] NVRM: GPU Board Serial Number: 1560121001476
[ 1361.908178] NVRM: Xid (PCI:0000:d8:00): 79, pid=0, GPU has fallen off the bus.
[ 1361.908186] NVRM: GPU 0000:d8:00.0: GPU has fallen off the bus.
[ 1361.908191] NVRM: GPU 0000:d8:00.0: GPU serial number is 1560121001476.
[ 1361.908210] NVRM: A GPU crash dump has been created. If possible, please run
               NVRM: nvidia-bug-report.sh as root to collect this data before
               NVRM: the NVIDIA kernel module is unloaded.

Based on this table it could be caused by a:

  • HW error
  • Driver issue
  • System Memory Corruption
  • Bus Error
  • Thermal Issue

A while ago a user was seeing the same issue and realized that the power cable wasn’t properly plugged into the GPU, which caused the same Xid, so you might want to start with this.

1 Like