Better GPU for training PyTorch CNN Model but turns out to be even slower

I trained a small and simple CNN model for image classification using the same PyTorch code in two GPU: Colab Free K80 and Paperspace Gradient P6000.

You can see both codes here (I’ve printed the detail of the GPU I used to make sure):

for convenience I show some parts of the code here:

The Model:

class Net(nn.Module):
   
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 16, kernel_size=4, stride=1, padding=0)
        self.pool = nn.MaxPool2d(kernel_size=3, stride=2, padding=0)
        self.conv2 = nn.Conv2d(16, 32, kernel_size=3, stride=1, padding=0)
        self.fc1 = nn.Linear(39200, 512)
        self.fc2 = nn.Linear(512, 5)
        self.do = nn.Dropout()
    
    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.reshape(x.shape[0], -1)
        x = self.do(F.relu(self.fc1(x)))
        x = self.fc2(x)
        return x

The Training Loop:

import time
s = time.time()
 
model.train()
for i in range(epoch):
    total_loss = 0
    total_sample = 0
    total_correct = 0
    for image, label in trainloaders:
        image = image.to('cuda')
        label = label.to('cuda')
        out = model(image)
        loss = criterion(out, label) 
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item() 
        total_sample += len(label)
        total_correct += torch.sum(torch.max(out,1)[1]==label).item()*1.0
    print(f"epoch {i} loss:{total_loss/total_sample}  acc:{total_correct/total_sample}")

e = time.time() # TRAINING TIME
print(e-s)

as far as I know, P6000 has better performance than K80, but when I measure the model training time using the code above, it shows that K80 only needs ~110s to train the model for 20 epochs, while P6000 needs ~140s (you can see the output in the code above).

I’ve run the code several times, restarted the kernel, or run another day but it always shows similar result. I’ve also tried using torch.cuda.synchronize() but the result is still the same.

I realize it only happens in my PyTorch Code, when I use Tensorflow, P6000 much faster than K80

Why did it happen?

You could check the Performance Guide and enable e.g. torch.backends.cudnn.benchmark =True in case you are using static input shapes so that cuDNN could profile all kernels and select the fastest one for your use case.

Thank you for the reference! I just tried torch.backends.cudnn.benchmark =True and also remove unecessary .item() from the code, and the results (I only run it 1x):

Paperspace Paid P6000, Cuda 11.4:

  • with item(); cudnn.benchmark=False : ~164s
  • without item(); cudnn.benchmark=True : ~152s

Colab Free K80, Cuda 11.1:

  • with item(); cudnn.benchmark=False : ~110s
  • without item(); cudnn.benchmark=True : ~85s

I also try to run it on RTX A6000 from JarvisLab.ai:

  • without item(); cudnn.benchmark=True: ~111s

K80 still outperforms P6000 and A6000

Based on the results it seems your runs are quite flaky, as the performance for the P6000 dropped from the initial 140s to 164s. In any case, could you remove the data loading entirely, use static random inputs, and profile the models again, please? Since you are using different nodes this would remove the potential differences of other bottlenecks and could point to the kernel selection for the GPUs.

Thank you for your help, I just tried using static input, I use the code below for the training loop

dummy_set = torch.rand(940, 3, 150, 150)
dummy_set.requires_grad=True
dummy_lbl = torch.randint(low=0, high=5, size=(940,))

import time
torch.backends.cudnn.benchmark =True 
torch.cuda.synchronize()
s = time.time()
 
model.train()
for i in range(epoch):
    total_loss = 0
    total_sample = 0
    total_correct = 0

    bst = 0  # batch start
    bsz = 64 # batch size
    ben = bst*bsz+bsz # batch end

    while ben!=len(dummy_set):
        image, label = dummy_set[bst*bsz:ben], dummy_lbl[bst*bsz:ben]
        image = image.to('cuda')
        label = label.to('cuda')
        out = model(image) 
        loss = criterion(out, label)
        optimizer.zero_grad()
        loss.backward() 
        optimizer.step() 
        total_loss += loss 
        total_sample += len(label)
        total_correct += torch.sum(torch.max(out,1)[1]==label)*1.0

        bst += 1 # increase batch
        ben = min(bst*bsz+bsz, len(dummy_set))

    print(f"epoch {i} loss:{total_loss/total_sample}  acc:{total_correct/total_sample}") 

torch.cuda.synchronize()
e = time.time()

and the result, the gap now is smaller but K80 is still faster ( I run it several times):

  • K80 : ~56s
  • P6000 : ~60s