Multiple GPU is not accelerating much, and profiling result is confusing

Dear PyTorch folks,

I am working on using multiple GPU in PyTorch to accelerate my training. I used the standard recipe from here https://github.com/kuangliu/pytorch-cifar

I used a tool (kernprof) to profile VGG19 training on CIFAR10 for just 1 epoch:
I found this (details are shown at the end of this post):

            forward (s)     backward (s)    optimizer.step() (s)    Total (s)
1 Epoch Results
1 GPU       6.6             31.1            6.3                     68.8
2 GPU       20.9            19.2            4.1                     63.6
3 Epochs Results
1 GPU       12.2            83.5            19.0                    189
2 GPU       42.9            34.8            14.7                    156

The sample code will double the batchsize for 2 GPU case, so I am supposed to receive about x2 in speed up, right? This is the code https://github.com/kuangliu/pytorch-cifar/blob/master/main.py:

net = VGG19()
net = torch.nn.DataParallel(net)
...
trainloader = torch.utils.data.DataLoader(trainset, batch_size=128, shuffle=True, num_workers=2)
...

For 1 epoch cases, the time durations that backward and optimizer.step() take are fine, although not by half, at least they are accelerating by about 1.5 times. But what I do not quite understand is: when increasing the GPU number from 1 to 2, the forward time increased by 3 times from 6.6s to 20.9 s (other than reduced by 2 times). Any thoughts on why this is like that?

P.S.: I saw @smth mentioned when profiling on GPU, people should set ‘CUDA_LAUNCH_BLOCKING = 1’ (My understanding is this is for serializing the runs of these kernels, so I should not set this if I want to profile multiple GPU cases) But I did give a try of that, for example for my 2 GPU case, my code is running, but seems only GPU 0 is fully loaded, the other GPU is idling, and the code does not print intermediate results, and never ends. I had to kill it in the end. So I always use CUDA_LAUNCH_BLOCKING = 0 (default) in my profiling.

I use PyTorch 0.4.0, CUDA 9.0, and K80 GPUs

1 GPU 1 Epoch Result

Timer unit: 1e-06 s

Total time: 68.8406 s
File: main.py
Function: train at line 84

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    84                                           @profile
    85                                           def train(epoch):
    86         1         30.0     30.0      0.0      print('\nEpoch: %d' % epoch)
    87         1        796.0    796.0      0.0      net.train()
    88         1          2.0      2.0      0.0      train_loss = 0
    89         1          1.0      1.0      0.0      correct = 0
    90         1          1.0      1.0      0.0      total = 0
    91       392    1289897.0   3290.6      1.9      for batch_idx, (inputs, targets) in enumerate(trainloader):
    92       391     260068.0    665.1      0.4          inputs, targets = inputs.to(device), targets.to(device)
    93       391     168321.0    430.5      0.2          optimizer.zero_grad()
    94       391    6615408.0  16919.2      9.6          outputs = net(inputs)
    95       391      43107.0    110.2      0.1          loss = criterion(outputs, targets)
    96       391   31119349.0  79589.1     45.2          loss.backward()
    97       391    6332262.0  16195.0      9.2          optimizer.step()
    98
    99       391   22786630.0  58277.8     33.1          train_loss += loss.item()
   100       391      48457.0    123.9      0.1          _, predicted = outputs.max(1)
   101       391       1970.0      5.0      0.0          total += targets.size(0)
   102       391      61846.0    158.2      0.1          correct += predicted.eq(targets).sum().item()
   103
   104       391       8328.0     21.3      0.0          progress_bar(batch_idx, len(trainloader), 'Loss: %.3f | Acc: %.3f%% (%d/%d)'
   105       391     104154.0    266.4      0.2              % (train_loss/(batch_idx+1), 100.*correct/total, correct, total))

2 GPU 1 Epoch Result

Timer unit: 1e-06 s

Total time: 63.6235 s
File: main.py
Function: train at line 84

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    84                                           @profile
    85                                           def train(epoch):
    86         1         33.0     33.0      0.0      print('\nEpoch: %d' % epoch)
    87         1        709.0    709.0      0.0      net.train()
    88         1          1.0      1.0      0.0      train_loss = 0
    89         1          0.0      0.0      0.0      correct = 0
    90         1          1.0      1.0      0.0      total = 0
    91       392    1238905.0   3160.5      1.9      for batch_idx, (inputs, targets) in enumerate(trainloader):
    92       391     191505.0    489.8      0.3          inputs, targets = inputs.to(device), targets.to(device)
    93       391     166484.0    425.8      0.3          optimizer.zero_grad()
    94       391   20888816.0  53424.1     32.8          outputs = net(inputs)
    95       391      51159.0    130.8      0.1          loss = criterion(outputs, targets)
    96       391   19224818.0  49168.3     30.2          loss.backward()
    97       391    4098131.0  10481.2      6.4          optimizer.step()
    98
    99       391   17517974.0  44803.0     27.5          train_loss += loss.item()
   100       391      74839.0    191.4      0.1          _, predicted = outputs.max(1)
   101       391       1864.0      4.8      0.0          total += targets.size(0)
   102       391      59388.0    151.9      0.1          correct += predicted.eq(targets).sum().item()
   103
   104       391       7663.0     19.6      0.0          progress_bar(batch_idx, len(trainloader), 'Loss: %.3f | Acc: %.3f%% (%d/%d)'
   105       391     101227.0    258.9      0.2              % (train_loss/(batch_idx+1), 100.*correct/total, correct, total))