Invalid device id on available gpu!

Code:

os.environ["CUDA_VISIBLE_DEVICES"] = ",".join([str(gpu) for gpu in args.gpus])
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print("args.gpus: ", args.gpus)
print("Available/CUDA_VISIBLE_DEVICES", os.environ["CUDA_VISIBLE_DEVICES"])
print("Device count", torch.cuda.device_count())

model = torch.nn.DataParallel(bert_model, device_ids=args.gpus) #FAILS
model.to(device)

The gpus seems available!
Output/Error

args.gpus:  [1, 2, 3, 4, 5, 6]
Available/CUDA_VISIBLE_DEVICES 1,2,3,4,5,6
Device count 6
Traceback (most recent call last):
  File "train_classifier.py", line 45, in <module>
    model = torch.nn.DataParallel(bert_model, device_ids=args.gpus)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/data_parallel.py", line 133, in __init__
    _check_balance(self.device_ids)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/data_parallel.py", line 19, in _check_balance
    dev_props = [torch.cuda.get_device_properties(i) for i in device_ids]
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/data_parallel.py", line 19, in <listcomp>
    dev_props = [torch.cuda.get_device_properties(i) for i in device_ids]
  File "/usr/local/lib/python3.7/dist-packages/torch/cuda/__init__.py", line 328, in get_device_properties
    raise AssertionError("Invalid device id")
AssertionError: Invalid device id

The gpu is available!

$ nvidia-smi

| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   40C    P0    59W / 300W |  15961MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   40C    P0    43W / 300W |     12MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:0A:00.0 Off |                    0 |
| N/A   39C    P0    44W / 300W |     12MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:0B:00.0 Off |                    0 |
| N/A   39C    P0    43W / 300W |     12MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000000:85:00.0 Off |                    0 |
| N/A   37C    P0    43W / 300W |     12MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:86:00.0 Off |                    0 |
| N/A   41C    P0    44W / 300W |     12MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:89:00.0 Off |                    0 |
| N/A   40C    P0    44W / 300W |     12MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |
| N/A   41C    P0    44W / 300W |     12MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

Can you try

model = torch.nn.DataParallel(bert_model, device_ids=[0,1,2,3,4,5])

I think the gpus are re-numbered.

Oh, renumbered. Isn’t the GPU number in nvidia-smi?
It goes through this way but I get the error(phew :)):
RuntimeError: CUDA out of memory. Tried to allocate 384.00 MiB (GPU 0; 31.75 GiB total capacity; 14.70 GiB already allocated; 378.44 MiB free; 14.76 GiB reserved in total by PyTorch)

It seems the memory error is on GPU 0, which is already has something else running on it. That’s why I didn’t start with 0.

What do you think? Which GPU does GPU 0 refer to here?

@abhigenie92 As @Erricia mentioned, the GPUs are renumbered since CUDA_VISIBLE_DEVICES is [1, 2, 3, 4, 5, 6] . So basically, 0 is mapped to the physical GPU 1 and 1 to GPU2 and so on. I verified this locally as well, where I had 0 memory usage on all GPUs and then using [0, 1, 2, 3, 4, 5] as the device ids with CUDA_VISIBLE_DEVICES=[1, 2, 3, 4, 5, 6], I see memory usage on GPUs 1-6:

Wed Jul 29 13:15:28 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.126.02   Driver Version: 418.126.02   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M40           On   | 00000000:07:00.0 Off |                  Off |
| N/A   44C    P8    17W / 250W |      0MiB / 12215MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M40           On   | 00000000:08:00.0 Off |                  Off |
| N/A   43C    P0    67W / 250W |    447MiB / 12215MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla M40           On   | 00000000:09:00.0 Off |                  Off |
| N/A   42C    P0    65W / 250W |    447MiB / 12215MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla M40           On   | 00000000:0A:00.0 Off |                  Off |
| N/A   42C    P0    67W / 250W |    447MiB / 12215MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla M40           On   | 00000000:0B:00.0 Off |                  Off |
| N/A   42C    P0    68W / 250W |    447MiB / 12215MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla M40           On   | 00000000:0C:00.0 Off |                  Off |
| N/A   44C    P0    66W / 250W |    447MiB / 12215MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla M40           On   | 00000000:0D:00.0 Off |                  Off |
| N/A   41C    P0    66W / 250W |    447MiB / 12215MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla M40           On   | 00000000:0E:00.0 Off |                  Off |
| N/A   36C    P8    18W / 250W |      0MiB / 12215MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+