3 gpus using nn.dataparallel is much slower than 1gpu

Shiyu · April 7, 2017, 7:52pm

I am wondering why the nn.dataparallel which use 3 gpus is slower than a single gpu. The batch size is 50. A single gpu uses 1 minutes for a epoch, while the dataparallel uses 3 minutes for a epoch.

The way I am using dataparallel is

net = Net()
net = torch.nn.DataParallel(net, device_ids=[0, 1, 2]).cuda()

Is the way I am using dataparallel? Thanks.

jekbradbury · April 7, 2017, 8:22pm

What kind of model are you using? DataParallel is much more effective with convnets than with complex recurrent architectures.

Shiyu · April 7, 2017, 9:08pm

This is my model

class Net(nn.Module):
def init(self):
super(Net, self).init()
self.conv1 = nn.Conv2d(3, 20, 5)

    self.pool = nn.MaxPool2d(2,2)
    self.conv2 = nn.Conv2d(20, 64, 5)

    self.fc1 = nn.Linear(64*5*5, 500)

    self.fc2 = nn.Linear(500, 200)

    self.fc3 = nn.Linear(200, 10)



def num_flat_features(self, x):
    size = x.size()[1:]
    num_features = 1
    for s in size:
        num_features *= s
    return num_features

def forward(self, x):
    x = self.pool(F.relu(self.conv1(x)))
    x = self.pool(F.relu(self.conv2(x)))
    x = x.view(-1, self.num_flat_features(x))
    x = F.relu(self.fc1(x))
    x = F.relu(self.fc2(x))
    x = self.fc3(x)
    return x

smth · April 9, 2017, 3:22pm

the trade-off is to parallelize over 3 GPUs vs giving enough work. If you use batch size of 50 per GPU then you might see improvement in scaling.
The problem you have is batchsize of 50 - 3GPU = batch size of 16 / GPU. This is likely underutilizing the GPU a lot.

wasiahmad · May 4, 2017, 7:25am

I am facing exactly the same issue when I run my program with 3 GPUs. In this case, do you suggest to increase the batch size? For example in my case, I set the batch size as 30 and ran on 3 Titan X GPUs. The difference between running time is huge.

While running in 1 GPU, one complete epoch takes 30 minutes but using DataParallel with 3 GPUs takes around 90 minutes. Do you have any suggestion to overcome this issue?

isalirezag · December 5, 2018, 6:51pm

anyone find any solution here?

vmirly1 · December 5, 2018, 6:57pm

If you use batch_size=30 using a single GPU, then when you use DataParallel with 3 GPUs, you should use batch_size=90 to make a fair comparison. The point of using DataParallel is that you can use a larger batch_size which then requires less number of iterations to complete one full epoch.

isalirezag · December 5, 2018, 6:58pm

then it gets out of memory error

vmirly1 · December 5, 2018, 7:01pm

No, if the 3 GPUs are similar to each other, (they have same amount of memory), then each one of them should be able to run with batch_size=30 independently. So basically when you use batch_size=90, each one will run with batch_size=30.

hemu · May 8, 2020, 1:45am

Then, if 3 gpus are used with batch size 30 then shouldnt they be atleast as fast as single gpu with batch size 30? And i guess there is possibility of out of memory as we load whole batch onto first gpu and nn.DataParallel() then divides equally to all gpus (out of memory when whole batch is loaded onto single gpu). Please correct me