I’m working in a 10-gpu cluster and I’m having some troubles with CUDA_VISIBLE_DEVICES.
Reading the forum I’ve seen it’s the recommended way of choosing arbitrary GPUs.
I’m running the following shell script:
export CUDA_VISIBLE_DEVICES=1,2,3
export PATH=/usr/local/cuda-9.0-cudnn--v7.0/lib64/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda-9.0-cudnn--v7.0/lib64"
export CUDA_HOME=/usr/local/cuda-9.0-cudnn--v7.0
python
Here you can see nvidia-smi display:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.26 Driver Version: 396.26 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN X (Pascal) Off | 00000000:04:00.0 Off | N/A |
| 47% 79C P2 190W / 250W | 6629MiB / 12196MiB | 92% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN X (Pascal) Off | 00000000:05:00.0 Off | N/A |
| 23% 26C P8 8W / 250W | 0MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 TITAN X (Pascal) Off | 00000000:06:00.0 Off | N/A |
| 41% 62C P2 62W / 250W | 11599MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 TITAN Xp Off | 00000000:07:00.0 Off | N/A |
| 41% 66C P2 185W / 250W | 11177MiB / 12196MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 4 TITAN X (Pascal) Off | 00000000:08:00.0 Off | N/A |
| 58% 86C P2 199W / 250W | 11761MiB / 12196MiB | 78% Default |
+-------------------------------+----------------------+----------------------+
| 5 TITAN X (Pascal) Off | 00000000:0B:00.0 Off | N/A |
| 51% 83C P2 213W / 250W | 11761MiB / 12196MiB | 90% Default |
+-------------------------------+----------------------+----------------------+
| 6 GeForce GTX 108... Off | 00000000:0C:00.0 Off | N/A |
| 23% 29C P8 8W / 250W | 0MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 TITAN X (Pascal) Off | 00000000:0D:00.0 Off | N/A |
| 54% 83C P2 143W / 250W | 11759MiB / 12196MiB | 93% Default |
+-------------------------------+----------------------+----------------------+
| 8 GeForce GTX 108... Off | 00000000:0E:00.0 Off | N/A |
| 23% 33C P8 11W / 250W | 0MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 9 TITAN X (Pascal) Off | 00000000:0F:00.0 Off | N/A |
| 45% 74C P2 103W / 250W | 11761MiB / 12196MiB | 55% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 52925 C python 6619MiB |
| 2 53553 C python3 11589MiB |
| 3 52925 C python 11167MiB |
| 4 53160 C python 11749MiB |
| 5 53161 C python 11749MiB |
| 7 56769 C python 11747MiB |
| 9 20983 C python 11749MiB |
+-----------------------------------------------------------------------------+
Now I try to store a simple variable in gpu:
import torch
import os
os.environ.get(‘CUDA_VISIBLE_DEVICES’)
‘1’
a=torch.rand(100)
a.cuda()
tensor([ 0.0554, 0.0375, 0.4708, 0.6522, 0.4640, 0.4087, 0.4738,
0.3571, 0.4100, 0.4238, 0.6673, 0.7015, 0.8013, 0.8452,
0.6704, 0.4123, 0.1702, 0.3805, 0.1789, 0.5453, 0.6197,
0.5231, 0.7428, 0.7978, 0.3173, 0.0653, 0.4624, 0.4298,
0.2032, 0.5640, 0.1568, 0.2366, 0.0436, 0.3464, 0.8633,
0.8253, 0.7330, 0.2782, 0.6662, 0.3576, 0.1209, 0.7470,
0.4402, 0.8037, 0.2154, 0.8686, 0.3976, 0.0305, 0.9457,
0.6998, 0.5220, 0.4419, 0.9357, 0.5723, 0.4109, 0.7055,
0.3444, 0.3484, 0.7930, 0.5491, 0.1293, 0.4718, 0.9671,
0.8292, 0.0422, 0.1354, 0.3751, 0.1575, 0.8005, 0.7624,
0.7628, 0.2370, 0.8926, 0.2794, 0.5764, 0.7508, 0.5215,
0.2245, 0.8482, 0.0440, 0.2812, 0.0715, 0.1664, 0.1170,
0.9271, 0.8802, 0.2525, 0.1377, 0.5035, 0.1035, 0.5497,
0.8906, 0.1272, 0.2019, 0.3545, 0.3818, 0.8902, 0.9140,
0.5344, 0.6614], device=‘cuda:0’)
it’s stored in cuda:0 instead of cuda1
According to nvidia-smi it’s
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 52646 C python 509MiB |
| 0 52925 C python 6619MiB |
| 2 53553 C python3 11751MiB |
| 3 52925 C python 11167MiB |
| 4 53160 C python 11749MiB |
| 5 53161 C python 11749MiB |
| 7 56769 C python 11747MiB |
| 9 20983 C python 11749MiB |
+-----------------------------------------------------------------------------+
Trying with GPUs 1,2 from nvidia smi
±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 7467 C python 509MiB |
| 0 52925 C python 6619MiB |
| 1 7467 C python 509MiB |
| 2 7388 C python3 11751MiB |
| 3 52925 C python 11167MiB |
| 4 53160 C python 11749MiB |
| 5 53161 C python 11749MiB |
| 7 56769 C python 11747MiB |
| 9 20983 C python 11749MiB |
±----------------------------------------------------------------------------+
It saves variables in gpu0 and gpu 1.
Why?
Another simple question. devicce_ids from pytorch are relative to the available GPUs? I mean, if u set CVD=4,5,8 GPU 4 from CVD would be cuda:0 for pytorch and so on, right?