This is an extended discussion of https://github.com/pytorch/pytorch/issues/4031
.
In that discussion, it is possible to find all process that reads /dev/nvidiaX and weren’t actually doing anything by manually checking the process id. However, when there’re multiple GPU, every processes open all /dev/nvidia (e.g. 0-7) even with CUDA_VISIBLE_DEVICES=x
is defined. Although the processes correctly uses the corresponding GPU, if this wasn’t occurring (i.e. process open all possible file descriptors), the solution of the issue would simply be fuser -k /dev/nvidiaX
(assuming that’s the only job). Technically it’s possible to kill individual process by careful checking (haven’t tried yet), but why does this happen at all? Is this a bug?
Here’s the dump of lsof /dev/nvidia0
. process was created with CUDA_VISIBLE_DEVICES defined. (only 1 GPU).
python3 974130 root 25u CHR 195,3 0t0 1217 /dev/nvidia3
python3 974130 root 26u CHR 195,0 0t0 1214 /dev/nvidia0
python3 974130 root 27u CHR 195,6 0t0 1220 /dev/nvidia6
python3 974130 root 28u CHR 195,1 0t0 1215 /dev/nvidia1
python3 974130 root 29u CHR 195,6 0t0 1220 /dev/nvidia6
python3 974130 root 30u CHR 195,7 0t0 1221 /dev/nvidia7
python3 974130 root 31u CHR 195,4 0t0 1218 /dev/nvidia4
python3 974130 root 32u CHR 195,5 0t0 1219 /dev/nvidia5
python3 974130 root 33u CHR 195,2 0t0 1216 /dev/nvidia2
python3 974130 root 34u CHR 195,2 0t0 1216 /dev/nvidia2
python3 974130 root 35u CHR 195,3 0t0 1217 /dev/nvidia3
python3 974130 root 36u CHR 195,3 0t0 1217 /dev/nvidia3
python3 974130 root 37u CHR 195,0 0t0 1214 /dev/nvidia0
python3 974130 root 38u CHR 195,0 0t0 1214 /dev/nvidia0
python3 974130 root 39u CHR 195,1 0t0 1215 /dev/nvidia1
python3 974130 root 40u CHR 195,1 0t0 1215 /dev/nvidia1
python3 974130 root 41u CHR 195,6 0t0 1220 /dev/nvidia6
python3 974130 root 42u CHR 195,6 0t0 1220 /dev/nvidia6
python3 974130 root 43u CHR 195,6 0t0 1220 /dev/nvidia6
python3 974130 root 44u CHR 195,6 0t0 1220 /dev/nvidia6
python3 974130 root 45u CHR 195,7 0t0 1221 /dev/nvidia7
python3 974130 root 46u CHR 195,7 0t0 1221 /dev/nvidia7
python3 974130 root 47u CHR 195,4 0t0 1218 /dev/nvidia4
python3 974130 root 48u CHR 195,4 0t0 1218 /dev/nvidia4
python3 974130 root 49u CHR 195,5 0t0 1219 /dev/nvidia5
python3 974130 root 50u CHR 195,5 0t0 1219 /dev/nvidia5
python3 974130 root 52u CHR 195,6 0t0 1220 /dev/nvidia6
python3 974130 root 53u CHR 195,6 0t0 1220 /dev/nvidia6
python3 974130 root 54u CHR 195,6 0t0 1220 /dev/nvidia6
python3 974130 root 55u CHR 195,6 0t0 1220 /dev/nvidia6
python3 974130 root 57u CHR 195,6 0t0 1220 /dev/nvidia6
python3 974130 root 58u CHR 195,6 0t0 1220 /dev/nvidia6
python3 974130 root 59u CHR 195,6 0t0 1220 /dev/nvidia6
python3 974130 root 60u CHR 195,6 0t0 1220 /dev/nvidia6```