Dear PyTorch folks,
I am working on using multiple GPU in PyTorch to accelerate my training. I used the standard recipe from here https://github.com/kuangliu/pytorch-cifar
I used a tool (kernprof) to profile VGG19 training on CIFAR10 for just 1 epoch:
I found this (details are shown at the end of this post):
forward (s) backward (s) optimizer.step() (s) Total (s)
1 Epoch Results
1 GPU 6.6 31.1 6.3 68.8
2 GPU 20.9 19.2 4.1 63.6
3 Epochs Results
1 GPU 12.2 83.5 19.0 189
2 GPU 42.9 34.8 14.7 156
The sample code will double the batchsize for 2 GPU case, so I am supposed to receive about x2 in speed up, right? This is the code https://github.com/kuangliu/pytorch-cifar/blob/master/main.py:
net = VGG19()
net = torch.nn.DataParallel(net)
...
trainloader = torch.utils.data.DataLoader(trainset, batch_size=128, shuffle=True, num_workers=2)
...
For 1 epoch cases, the time durations that backward and optimizer.step() take are fine, although not by half, at least they are accelerating by about 1.5 times. But what I do not quite understand is: when increasing the GPU number from 1 to 2, the forward time increased by 3 times from 6.6s to 20.9 s (other than reduced by 2 times). Any thoughts on why this is like that?
P.S.: I saw @smth mentioned when profiling on GPU, people should set ‘CUDA_LAUNCH_BLOCKING = 1’ (My understanding is this is for serializing the runs of these kernels, so I should not set this if I want to profile multiple GPU cases) But I did give a try of that, for example for my 2 GPU case, my code is running, but seems only GPU 0 is fully loaded, the other GPU is idling, and the code does not print intermediate results, and never ends. I had to kill it in the end. So I always use CUDA_LAUNCH_BLOCKING = 0 (default) in my profiling.
I use PyTorch 0.4.0, CUDA 9.0, and K80 GPUs
1 GPU 1 Epoch Result
Timer unit: 1e-06 s
Total time: 68.8406 s
File: main.py
Function: train at line 84
Line # Hits Time Per Hit % Time Line Contents
==============================================================
84 @profile
85 def train(epoch):
86 1 30.0 30.0 0.0 print('\nEpoch: %d' % epoch)
87 1 796.0 796.0 0.0 net.train()
88 1 2.0 2.0 0.0 train_loss = 0
89 1 1.0 1.0 0.0 correct = 0
90 1 1.0 1.0 0.0 total = 0
91 392 1289897.0 3290.6 1.9 for batch_idx, (inputs, targets) in enumerate(trainloader):
92 391 260068.0 665.1 0.4 inputs, targets = inputs.to(device), targets.to(device)
93 391 168321.0 430.5 0.2 optimizer.zero_grad()
94 391 6615408.0 16919.2 9.6 outputs = net(inputs)
95 391 43107.0 110.2 0.1 loss = criterion(outputs, targets)
96 391 31119349.0 79589.1 45.2 loss.backward()
97 391 6332262.0 16195.0 9.2 optimizer.step()
98
99 391 22786630.0 58277.8 33.1 train_loss += loss.item()
100 391 48457.0 123.9 0.1 _, predicted = outputs.max(1)
101 391 1970.0 5.0 0.0 total += targets.size(0)
102 391 61846.0 158.2 0.1 correct += predicted.eq(targets).sum().item()
103
104 391 8328.0 21.3 0.0 progress_bar(batch_idx, len(trainloader), 'Loss: %.3f | Acc: %.3f%% (%d/%d)'
105 391 104154.0 266.4 0.2 % (train_loss/(batch_idx+1), 100.*correct/total, correct, total))
2 GPU 1 Epoch Result
Timer unit: 1e-06 s
Total time: 63.6235 s
File: main.py
Function: train at line 84
Line # Hits Time Per Hit % Time Line Contents
==============================================================
84 @profile
85 def train(epoch):
86 1 33.0 33.0 0.0 print('\nEpoch: %d' % epoch)
87 1 709.0 709.0 0.0 net.train()
88 1 1.0 1.0 0.0 train_loss = 0
89 1 0.0 0.0 0.0 correct = 0
90 1 1.0 1.0 0.0 total = 0
91 392 1238905.0 3160.5 1.9 for batch_idx, (inputs, targets) in enumerate(trainloader):
92 391 191505.0 489.8 0.3 inputs, targets = inputs.to(device), targets.to(device)
93 391 166484.0 425.8 0.3 optimizer.zero_grad()
94 391 20888816.0 53424.1 32.8 outputs = net(inputs)
95 391 51159.0 130.8 0.1 loss = criterion(outputs, targets)
96 391 19224818.0 49168.3 30.2 loss.backward()
97 391 4098131.0 10481.2 6.4 optimizer.step()
98
99 391 17517974.0 44803.0 27.5 train_loss += loss.item()
100 391 74839.0 191.4 0.1 _, predicted = outputs.max(1)
101 391 1864.0 4.8 0.0 total += targets.size(0)
102 391 59388.0 151.9 0.1 correct += predicted.eq(targets).sum().item()
103
104 391 7663.0 19.6 0.0 progress_bar(batch_idx, len(trainloader), 'Loss: %.3f | Acc: %.3f%% (%d/%d)'
105 391 101227.0 258.9 0.2 % (train_loss/(batch_idx+1), 100.*correct/total, correct, total))