Code:
os.environ["CUDA_VISIBLE_DEVICES"] = ",".join([str(gpu) for gpu in args.gpus])
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("args.gpus: ", args.gpus)
print("Available/CUDA_VISIBLE_DEVICES", os.environ["CUDA_VISIBLE_DEVICES"])
print("Device count", torch.cuda.device_count())
model = torch.nn.DataParallel(bert_model, device_ids=args.gpus) #FAILS
model.to(device)
The gpus seems available!
Output/Error
args.gpus: [1, 2, 3, 4, 5, 6]
Available/CUDA_VISIBLE_DEVICES 1,2,3,4,5,6
Device count 6
Traceback (most recent call last):
File "train_classifier.py", line 45, in <module>
model = torch.nn.DataParallel(bert_model, device_ids=args.gpus)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/data_parallel.py", line 133, in __init__
_check_balance(self.device_ids)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/data_parallel.py", line 19, in _check_balance
dev_props = [torch.cuda.get_device_properties(i) for i in device_ids]
File "/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/data_parallel.py", line 19, in <listcomp>
dev_props = [torch.cuda.get_device_properties(i) for i in device_ids]
File "/usr/local/lib/python3.7/dist-packages/torch/cuda/__init__.py", line 328, in get_device_properties
raise AssertionError("Invalid device id")
AssertionError: Invalid device id
The gpu is available!
$ nvidia-smi
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:06:00.0 Off | 0 |
| N/A 40C P0 59W / 300W | 15961MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:07:00.0 Off | 0 |
| N/A 40C P0 43W / 300W | 12MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000000:0A:00.0 Off | 0 |
| N/A 39C P0 44W / 300W | 12MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000000:0B:00.0 Off | 0 |
| N/A 39C P0 43W / 300W | 12MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-SXM2... On | 00000000:85:00.0 Off | 0 |
| N/A 37C P0 43W / 300W | 12MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM2... On | 00000000:86:00.0 Off | 0 |
| N/A 41C P0 44W / 300W | 12MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM2... On | 00000000:89:00.0 Off | 0 |
| N/A 40C P0 44W / 300W | 12MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100-SXM2... On | 00000000:8A:00.0 Off | 0 |
| N/A 41C P0 44W / 300W | 12MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+