Why does my code run very slow on testset?

The code is here:

    device = torch.device('cuda')
    num_classes = 21

    model = resnet34_aspp()
    model.load_state_dict(torch.load('./resnet34-aspp.pth'))

    train_transform = A.Compose([
        A.HorizontalFlip(p=0.5),
        A.VerticalFlip(p=0.5),
        A.RandomRotate90(p=0.5),
        A.Resize(512, 512),
    ])
    test_transform = A.Compose([
        A.Resize(512, 512)
    ])

    datapath = '/root/autodl-tmp/data/VOC'

    trainset = VOCSeg(root=datapath, year='2012', image_set='train', download=False, transforms=train_transform)
    trainloader = torch.utils.data.DataLoader(
        dataset=trainset,
        shuffle=True,
        batch_size=80,
        num_workers=12,
        pin_memory=True
    )
    testset = VOCSeg(root=datapath, year='2012', image_set='val', download=False, transforms=test_transform)
    testloader = torch.utils.data.DataLoader(
        dataset=testset,
        shuffle=False,
        batch_size=32,
        num_workers=12,
        pin_memory=True
    )
    optimizer = torch.optim.SGD(model.parameters(),
                                lr=1e-4,
                                momentum=0.9,
                                weight_decay=5e-4
                                )
    model.train()
    model.to(device)
    loss_func = nn.CrossEntropyLoss()
    confm = ConfusionMatrix(21)
    for epoch in tqdm(range(1, 201)):
        loss_mean = []
        model.train()
        for inputs, labels in trainloader:
            inputs, labels = inputs.to(device), labels.long().to(device)
            outputs = model(inputs)
            loss = loss_func(outputs, labels)
            loss_mean.append(loss.item())
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
        print(f'epoch{epoch}:{np.sum(loss_mean) / len(loss_mean)}')

        if epoch % 20 == 0:
            with torch.no_grad():
                model.eval()
                print('test')
                for inputs, labels in tqdm(testloader):
                    inputs, labels = inputs.to(device), labels.to(device)
                    outputs = model(inputs)
                    confm.update(labels.flatten(), outputs.argmax(1).flatten())
                s_acc_global, s_acc, s_iou, s_f1 = confm.compute(with_background=False)
                s_acc, s_iou, s_f1 = s_acc.mean().item(), s_iou.mean().item(), s_f1.mean().item()
                print(f's_acc_global:{s_acc_global} s_iou:{s_iou} s_f1{s_f1}')
            confm.reset()
        torch.save(model.state_dict(), 'resnet34-aspp.pth')

It takes 153 seconds to finish an epoch on trainset. But it takes almost an hour to finish an epoch on testset (trainset contains about 15000 pictures and testset contains about 1600 pictures). I notice that the GPU memory usage is 24186MiB / 24576MiB when the model is been training, which I guess there is GPU memory exchange during the test period. But I’m not sure that this can be such time comsuption.

Profile your code and try to narrow down where the bottleneck is. E.g. it’s unclear how expensive confm.compute(with_background=False) is etc.
Alternatively remove additional code executed in the validation loop until you (almost) match the training code and check if this would speed it up.

Hey, @ptrblck .
The confusion matrix code, and I think it is normal.

class ConfusionMatrix(object):
    def __init__(self, num_classes):
        self.num_classes = num_classes
        self.mat = None

    def update(self, a, b):
        n = self.num_classes
        if self.mat is None:
            self.mat = torch.zeros((n, n), dtype=torch.int64, device=a.device)
        with torch.no_grad():
            k = (a >= 0) & (a < n)
            inds = n * a[k].to(torch.int64) + b[k]
            self.mat += torch.bincount(inds, minlength=n ** 2).reshape(n, n)

    def reset(self):
        if self.mat is not None:
            self.mat.zero_()

    def compute(self, with_background=False):
        h = self.mat.float()
        acc_global = torch.diag(h).sum() / h.sum()
        acc = torch.diag(h) / h.sum(1)
        iu = torch.diag(h) / (h.sum(1) + h.sum(0) - torch.diag(h))
        if with_background is False:
            iu = iu[1:]
        recall = torch.diag(h) / h.sum(0)
        f1 = 2 * acc * recall / (acc + recall)
        return acc_global, acc, iu, f1

I debugged my code, and finally, I figured it out. This line is the problem.

    for inputs, labels in tqdm(testloader):
        inputs, labels = inputs.to(device), labels.to(device) # this line

The “.to(device)” process costs almost 95% time in the test period.
Moreover, I tried to change the trainloader batch size from 80 to 48, which means there will be enough GPU memory for the testloader, and this problem is solved. Another solution is changing testloader batchsize to 10 (trainlaoder remains 80). I guess the problem is about how Pytorch deals with GPU memory exchange or some CUDA things. By the way, my Pytorch version is 1.11.0+cu113.

PyTorch does not offload GPU memory for you so I don’t understand how apparently memory is migrated behind your back.