I am wondering why the nn.dataparallel which use 3 gpus is slower than a single gpu. The batch size is 50. A single gpu uses 1 minutes for a epoch, while the dataparallel uses 3 minutes for a epoch.
The way I am using dataparallel is
net = Net()
net = torch.nn.DataParallel(net, device_ids=[0, 1, 2]).cuda()
the trade-off is to parallelize over 3 GPUs vs giving enough work. If you use batch size of 50 per GPU then you might see improvement in scaling.
The problem you have is batchsize of 50 - 3GPU = batch size of 16 / GPU. This is likely underutilizing the GPU a lot.
I am facing exactly the same issue when I run my program with 3 GPUs. In this case, do you suggest to increase the batch size? For example in my case, I set the batch size as 30 and ran on 3 Titan X GPUs. The difference between running time is huge.
While running in 1 GPU, one complete epoch takes 30 minutes but using DataParallel with 3 GPUs takes around 90 minutes. Do you have any suggestion to overcome this issue?
If you use batch_size=30 using a single GPU, then when you use DataParallel with 3 GPUs, you should use batch_size=90 to make a fair comparison. The point of using DataParallel is that you can use a larger batch_size which then requires less number of iterations to complete one full epoch.
No, if the 3 GPUs are similar to each other, (they have same amount of memory), then each one of them should be able to run with batch_size=30 independently. So basically when you use batch_size=90, each one will run with batch_size=30.
Then, if 3 gpus are used with batch size 30 then shouldnt they be atleast as fast as single gpu with batch size 30? And i guess there is possibility of out of memory as we load whole batch onto first gpu and nn.DataParallel() then divides equally to all gpus (out of memory when whole batch is loaded onto single gpu). Please correct me