Processes open all /dev/nvidia* with CUDA_VISIBLE_DEVICES defined

This is an extended discussion of https://github.com/pytorch/pytorch/issues/4031.
In that discussion, it is possible to find all process that reads /dev/nvidiaX and weren’t actually doing anything by manually checking the process id. However, when there’re multiple GPU, every processes open all /dev/nvidia (e.g. 0-7) even with CUDA_VISIBLE_DEVICES=x is defined. Although the processes correctly uses the corresponding GPU, if this wasn’t occurring (i.e. process open all possible file descriptors), the solution of the issue would simply be fuser -k /dev/nvidiaX (assuming that’s the only job). Technically it’s possible to kill individual process by careful checking (haven’t tried yet), but why does this happen at all? Is this a bug?

Here’s the dump of lsof /dev/nvidia0. process was created with CUDA_VISIBLE_DEVICES defined. (only 1 GPU).

python3 974130 root   25u   CHR   195,3      0t0 1217 /dev/nvidia3
python3 974130 root   26u   CHR   195,0      0t0 1214 /dev/nvidia0
python3 974130 root   27u   CHR   195,6      0t0 1220 /dev/nvidia6
python3 974130 root   28u   CHR   195,1      0t0 1215 /dev/nvidia1
python3 974130 root   29u   CHR   195,6      0t0 1220 /dev/nvidia6
python3 974130 root   30u   CHR   195,7      0t0 1221 /dev/nvidia7
python3 974130 root   31u   CHR   195,4      0t0 1218 /dev/nvidia4
python3 974130 root   32u   CHR   195,5      0t0 1219 /dev/nvidia5
python3 974130 root   33u   CHR   195,2      0t0 1216 /dev/nvidia2
python3 974130 root   34u   CHR   195,2      0t0 1216 /dev/nvidia2
python3 974130 root   35u   CHR   195,3      0t0 1217 /dev/nvidia3
python3 974130 root   36u   CHR   195,3      0t0 1217 /dev/nvidia3
python3 974130 root   37u   CHR   195,0      0t0 1214 /dev/nvidia0
python3 974130 root   38u   CHR   195,0      0t0 1214 /dev/nvidia0
python3 974130 root   39u   CHR   195,1      0t0 1215 /dev/nvidia1
python3 974130 root   40u   CHR   195,1      0t0 1215 /dev/nvidia1
python3 974130 root   41u   CHR   195,6      0t0 1220 /dev/nvidia6
python3 974130 root   42u   CHR   195,6      0t0 1220 /dev/nvidia6
python3 974130 root   43u   CHR   195,6      0t0 1220 /dev/nvidia6
python3 974130 root   44u   CHR   195,6      0t0 1220 /dev/nvidia6
python3 974130 root   45u   CHR   195,7      0t0 1221 /dev/nvidia7
python3 974130 root   46u   CHR   195,7      0t0 1221 /dev/nvidia7
python3 974130 root   47u   CHR   195,4      0t0 1218 /dev/nvidia4
python3 974130 root   48u   CHR   195,4      0t0 1218 /dev/nvidia4
python3 974130 root   49u   CHR   195,5      0t0 1219 /dev/nvidia5
python3 974130 root   50u   CHR   195,5      0t0 1219 /dev/nvidia5
python3 974130 root   52u   CHR   195,6      0t0 1220 /dev/nvidia6
python3 974130 root   53u   CHR   195,6      0t0 1220 /dev/nvidia6
python3 974130 root   54u   CHR   195,6      0t0 1220 /dev/nvidia6
python3 974130 root   55u   CHR   195,6      0t0 1220 /dev/nvidia6
python3 974130 root   57u   CHR   195,6      0t0 1220 /dev/nvidia6
python3 974130 root   58u   CHR   195,6      0t0 1220 /dev/nvidia6
python3 974130 root   59u   CHR   195,6      0t0 1220 /dev/nvidia6
python3 974130 root   60u   CHR   195,6      0t0 1220 /dev/nvidia6```

I would recommend asking this question in the NVIDIA discussion board as it seems to be more driver-related.

1 Like