CUDA_VISIBLE_DEVICES=0,1 python xxx.py, not work

I have 2 GPUs,

when I want to use one of GPUs to train, with the following code, both work.
CUDA_VISIBLE_DEVICES=0 python xxx.py,
CUDA_VISIBLE_DEVICES=1 python xxx.py,

However, when I want to use 2 GPUs to train, with the following code,
CUDA_VISIBLE_DEVICES=0,1 python xxx.py,
it doesn’t work anymore. Only the default GPU:0 is used for training, when the memory of GPU:0 run out of, the training will be terminated with error ‘out of memory’. The GPU:1 is lying idle and not be used. Why?

the GPU information is showed in the following:
±----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version: 10.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00006B71:00:00.0 Off | 0 |
| N/A 54C P0 82W / 149W | 8772MiB / 11441MiB | 40% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla K80 Off | 000096F1:00:00.0 Off | 0 |
| N/A 25C P8 32W / 149W | 11MiB / 11441MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2495 C python 8759MiB |
±----------------------------------------------------------------------------+

Could someone explain this situation? and What should I do to love that?

Thanks in advance and really appreciate for any feedback.

that is a cuda enviroment variable.
It’s not pytorch.
What it means is that you can call any bash command preceded by that. It basically manages which devices the called process can see.
So in your case it means that your python kernel can only see one of the gpus. Both if you set 0,1. But it doesn’t mean that pytorch is gonna train on both automatically.
To train on several gpus you should use modules like https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html

Thanks for your information. Actually I have tried

model.cuda()
model = torch.nn.DataParallel(model, device_ids=[0, 1])

but there is always the following errors:

File “/home/speech/treelstm_nlg/HRED/ContextLSTMLayer.py”, line 21, in forward
output, (hn, cn) = self.rnn(x, (h_0, c_0))
File “/home/speech/treelstm_nlg/venv/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 727, in _call_impl
result = self.forward(*input, **kwargs)
File “/home/speech/treelstm_nlg/venv/lib/python3.6/site-packages/torch/nn/modules/rnn.py”, line 579, in forward
self.check_forward_args(input, hx, batch_sizes)
File “/home/speech/treelstm_nlg/venv/lib/python3.6/site-packages/torch/nn/modules/rnn.py”, line 534, in check_forward_args
‘Expected hidden[0] size {}, got {}’)
File “/home/speech/treelstm_nlg/venv/lib/python3.6/site-packages/torch/nn/modules/rnn.py”, line 196, in check_hidden_size
raise RuntimeError(msg.format(expected_hidden_size, list(hx.size())))
RuntimeError: Expected hidden[0] size (1, 64, 300), got [1, 128, 300]

why for that?

By the way, what is the difference between DataParallel and DistributedDataParallel, I seems the official documents suggest more DistributedDataParallel. In my case, some computation code is written by myself, so I am not sure, which is the best choice.

Thanks in advance!! Really appreciate for any feedback.

Hmmm So basically it does a copy of the model in each gpu. Then, when you call forward, it takes the batch, created as many chunks as gpus a send them to the corresponding one.

it assumes the batch dimension is the dimension 0. I think this issue can happens due to he fact rnn operates as temporal dimension first. Soo how is the input to your network sorted?

WRT DistributedDP I don’t really know why. It wasn’t the case some versions ago. It seems the performance of DP is worse (https://pytorch.org/docs/stable/notes/cuda.html#cuda-nn-ddp-instead)

Anyway both will probably work in a similar way.

In my case, batch_first=True. Isn’t it right?

Hmmm then I don’t really know.
I mean, the module does nothing but splitting the batch into two chunks.
Are you sure the error doesn’t raise without it?

Can you try to post a standalone script?