While using nn.DataParallel only accessing one GPU

I have 2gpus in my system. I am using Dataparallel module over my model and I have made my both of the gpu visible using os.environ[“CUDA_VISIBLE_DEVICES”] = “0,1”. But, while running only 0 is selected if zero is the first in visible devises entry or 1 if it is the first in entry. What could be reasons why both of gpus wasn’t used together?

Then I tried manually creating replicas. But, I am getting error
RuntimeError: torch/csrc/autograd/variable.cpp:115: get_grad_fn: Assertion output_nr == 0 failed.
This happens at my LSTM part of my code

replicas = t.nn.parallel.replicate( model , device_ids)
inputs = t.nn.parallel.scatter(( dropout, encoder_word_input, encoder_character_input,
decoder_word_input, decoder_character_input,
None), device_ids)
replicas = replicas[:len(inputs)]
outputs = t.nn.parallel.parallel_apply(replicas, inputs)
out = t.nn.parallel.gather(outputs, output_device)

Did you specify device_ids?

Yes, I have specified it. It is not working
In my forward function of my model, I am taking multiple inputs. One of the inputs is an floating point, 4 of them are Variables of dimension minibatch x feature size, other is a latent vector input.
Is that could be reason for using only using 1 GPU?

Hmm… Can you show us the lines you use DataParallel in your codes?

rvae = RVAE(parameters)
device_ids = [i for i in range(t.cuda.device_count())]
rvae = t.nn.DataParallel(rvae, device_ids)
rvae = rvae.cuda()

forward function is the following

rvae(0., encoder_word_input, encoder_character_input,
decoder_word_input, decoder_character_input,
z=None)

Just remove rvae = rvae.cuda(). I think this is the wrong part.

That gave me an error. Expected object of type torch.LongTensor but found type torch.cuda.LongTensor for argument #3 ‘index’

How do you know only one GPU is working? Is the other one completely empty?
Could you add a print statement in your forward method, showing the current device of the tensor?
You can find a small example here. You would have to change it to print the device instead of the shape.

I am using watch nvidia-smi to see the memory usage. Only one GPU is used and the other is not used

I checked what you told. By using get_device() in forward function. Only one GPU is used.

I tried manually creating replicas.

replicas = t.nn.parallel.replicate(rvae (my model), device_ids)
inputs = t.nn.parallel.scatter(( dropout, encoder_word_input, encoder_character_input,
decoder_word_input, decoder_character_input,
None), device_ids)
replicas = replicas[:len(inputs)]
outputs = t.nn.parallel.parallel_apply(replicas, inputs)
out = t.nn.parallel.gather(outputs, output_device)

But, I am getting error
RuntimeError: torch/csrc/autograd/variable.cpp:115: get_grad_fn: Assertion output_nr == 0 failed.

This happens at my LSTM part of my code

Sorry, you’re right. You need to send DataParallel to GPU.

What is your batchsize? You need a batch_size > 1 to use both GPUs.

I am facing the same problem and my batch_size is > 1. I am using 2 NVIDIA P100 in google kubernetes engine.
@Tony_Gracious, may I know if you have solved this problem?

No, I didn’t solved it

@Tony_Gracious in my case, it was because I was initially train the model using nn.DataParallel with one GPU, then once I reload the model, it seems DataParallel still store the previous device_ids, hence the single GPU. Now I manage to solve it by every time I load the model, I will re-wrap my model with nn.DataParallel

model = _load_model()
model = nn.DataParallel(model.module)

You also need to be careful for your data size. I need to modify model = nn.DataParallel(model.module) into model = nn.DataParallel(model.module, dim=1) since I am using batch_first=False

1 Like

device = torch.device(‘cuda :0,1’ if torch.cuda.is_available() else ‘cpu’)
rvae = RVAE(parameters)
rvae = nn.DataParallel(rvae)
rvae.to(device)

Anyone solved this? I met the same situation。

I think I have solved this problem.
For your model=DataParallel(model) at forward() step, if you pass arguments into forward(), according to pytorch document:

Arbitrary positional and keyword inputs are allowed to be passed into DataParallel EXCEPT Tensors. All tensors will be scattered on dim specified (default 0). Primitive types will be broadcasted, but all other types will be a shallow copy and can be corrupted if written to in the model’s forward pass.

which means if the input argument type is tensor then it would be split by dim=0 (which is the batch dimension). For other types like python list/dict/str,DataParallel.forward() automatically copies it to N replicas ( N equals to your GPU number).
The key is that, if you pass an argument like this
[torch.tensor]
or
{"example":torch.tensor}
Even though they are python list/dict, but DataParallel.forward() is not able to deal with these type of argument (And it won’t raise an error). So the fix is just to simply convert those argument (and all their elements) to python types.

1 Like

What do you mean by python types?