Better GPU for training PyTorch CNN Model but turns out to be even slower

malioboro · November 13, 2021, 5:11am

I trained a small and simple CNN model for image classification using the same PyTorch code in two GPU: Colab Free K80 and Paperspace Gradient P6000.

You can see both codes here (I’ve printed the detail of the GPU I used to make sure):

Colab K80: Google Colab
Gradient P6000: Paperspace Console

for convenience I show some parts of the code here:

The Model:

class Net(nn.Module):
   
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 16, kernel_size=4, stride=1, padding=0)
        self.pool = nn.MaxPool2d(kernel_size=3, stride=2, padding=0)
        self.conv2 = nn.Conv2d(16, 32, kernel_size=3, stride=1, padding=0)
        self.fc1 = nn.Linear(39200, 512)
        self.fc2 = nn.Linear(512, 5)
        self.do = nn.Dropout()
    
    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.reshape(x.shape[0], -1)
        x = self.do(F.relu(self.fc1(x)))
        x = self.fc2(x)
        return x

The Training Loop:

import time
s = time.time()
 
model.train()
for i in range(epoch):
    total_loss = 0
    total_sample = 0
    total_correct = 0
    for image, label in trainloaders:
        image = image.to('cuda')
        label = label.to('cuda')
        out = model(image)
        loss = criterion(out, label) 
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item() 
        total_sample += len(label)
        total_correct += torch.sum(torch.max(out,1)[1]==label).item()*1.0
    print(f"epoch {i} loss:{total_loss/total_sample}  acc:{total_correct/total_sample}")

e = time.time() # TRAINING TIME
print(e-s)

as far as I know, P6000 has better performance than K80, but when I measure the model training time using the code above, it shows that K80 only needs ~110s to train the model for 20 epochs, while P6000 needs ~140s (you can see the output in the code above).

I’ve run the code several times, restarted the kernel, or run another day but it always shows similar result. I’ve also tried using torch.cuda.synchronize() but the result is still the same.

I realize it only happens in my PyTorch Code, when I use Tensorflow, P6000 much faster than K80

Why did it happen?

ptrblck · November 13, 2021, 10:07am

You could check the Performance Guide and enable e.g. torch.backends.cudnn.benchmark =True in case you are using static input shapes so that cuDNN could profile all kernels and select the fastest one for your use case.

malioboro · November 13, 2021, 3:30pm

Thank you for the reference! I just tried torch.backends.cudnn.benchmark =True and also remove unecessary .item() from the code, and the results (I only run it 1x):

Paperspace Paid P6000, Cuda 11.4:

with item(); cudnn.benchmark=False : ~164s
without item(); cudnn.benchmark=True : ~152s

Colab Free K80, Cuda 11.1:

with item(); cudnn.benchmark=False : ~110s
without item(); cudnn.benchmark=True : ~85s

I also try to run it on RTX A6000 from JarvisLab.ai:

without item(); cudnn.benchmark=True: ~111s

K80 still outperforms P6000 and A6000

ptrblck · November 14, 2021, 1:29am

Based on the results it seems your runs are quite flaky, as the performance for the P6000 dropped from the initial 140s to 164s. In any case, could you remove the data loading entirely, use static random inputs, and profile the models again, please? Since you are using different nodes this would remove the potential differences of other bottlenecks and could point to the kernel selection for the GPUs.

malioboro · November 20, 2021, 12:23am

Thank you for your help, I just tried using static input, I use the code below for the training loop

dummy_set = torch.rand(940, 3, 150, 150)
dummy_set.requires_grad=True
dummy_lbl = torch.randint(low=0, high=5, size=(940,))

import time
torch.backends.cudnn.benchmark =True 
torch.cuda.synchronize()
s = time.time()
 
model.train()
for i in range(epoch):
    total_loss = 0
    total_sample = 0
    total_correct = 0

    bst = 0  # batch start
    bsz = 64 # batch size
    ben = bst*bsz+bsz # batch end

    while ben!=len(dummy_set):
        image, label = dummy_set[bst*bsz:ben], dummy_lbl[bst*bsz:ben]
        image = image.to('cuda')
        label = label.to('cuda')
        out = model(image) 
        loss = criterion(out, label)
        optimizer.zero_grad()
        loss.backward() 
        optimizer.step() 
        total_loss += loss 
        total_sample += len(label)
        total_correct += torch.sum(torch.max(out,1)[1]==label)*1.0

        bst += 1 # increase batch
        ben = min(bst*bsz+bsz, len(dummy_set))

    print(f"epoch {i} loss:{total_loss/total_sample}  acc:{total_correct/total_sample}") 

torch.cuda.synchronize()
e = time.time()

and the result, the gap now is smaller but K80 is still faster ( I run it several times):

K80 : ~56s
P6000 : ~60s