When using DataParallel, the other half of the batch disappeared

xinsss · March 2, 2020, 8:14am

I have two GPUs and I am trying to use torch.nn.DataParallel to wrap my model. So my model could run on two GPUs. I successfully did this many times before. However, something strange happened.

When the model is in training, I tried to print the input tensor inside the model’s forward function. There is only half batch size data there. For example, the batch_size =16, the model is only using 8 in training.

When the model is in eval mode, I also tried to print the input tensor inside the model’s forward function. Sometimes it will print twice (each with batch of 8) as expected.

Any ideas?

Here is my code

...
device = torch.device("cuda:0")
model.to(device)
model = torch.nn.DataParallel(model, device_ids=[0, 1])

# Epochs
for _ range(int(num_train_epochs)):
    for step, batch in enumerate(epoch_iterator):
        model.train()
        loss = model(**inputs) # I compute the loss inside forward for balancing GPU memory.
....

jmandivarapu1 · March 2, 2020, 5:39pm

I think may be because of the nn.DataParallel it got split into two GPUS 8 in each

xinsss · March 2, 2020, 6:59pm

That is what I wanted it to do. If it did split the batch to two GPUs (each has batch of 8), then the print function in the model’s forward function will be called twice to print batch of 8 each time.

Balamurali_M · March 30, 2020, 5:23am

Did you identify the problem ? I am facing similar issues.

xinsss · March 30, 2020, 8:55pm

Yes, I was able to solve the issue. I found my input tensors have different size. So the DataParallel couldn’t properly split the input tensors into GPUs. So maybe check if your inputs size is consistent?