Single Conv2D layer has larger footprint than resnet50

While investigating another issue, I stumbled over some weird behaviour, where using a single Con2D layer for inference blows up my memory usage more than using a resnet50. This is the code that I’m using:

import os

from tqdm import tqdm
from torchvision.transforms import ToTensor
from torchvision.models import resnet50, resnet18
from import DataLoader
from import Dataset
import torch
import torch.nn as nn
import cv2

n = 1000000

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

class RandomDs(Dataset):
    def __init__(self, ):

    def __len__(self):
        return n

    def __getitem__(self, index):
        return torch.rand(3, 256, 256)

if __name__ == '__main__':

    dataset = RandomDs()
    data_loader = DataLoader(dataset, batch_size=128, shuffle=False, num_workers=4, pin_memory=False)

    model = resnet50()
    # model = nn.Conv2d(3, 128, 3)

    os.environ["CUDA_VISIBLE_DEVICES"] = "0"
    device = torch.device('cuda:0')
    # model = nn.DataParallel(model)

    s = f"{count_parameters(model):.3e} trainable parameters"

    with torch.no_grad():
        for batch in tqdm(data_loader):
            batch =

I ran this code with 4 configurations (model is either resnet50 or nn.Conv2d(3, 128, 3); and DataParallel is turned on or off (off means I just use a single GPU).

For Conv2D, single GPU I get

For Conv2D, DataParallel I get

For resnet50, single GPU I get

For resnet50, DataParallel I get

Why is that? Conv2D uses way more memory than resnet50 and memory increases for Conv2D if I use Dataparallel.

I am using Cuda 10.2 and pytorch 1.2.0


Keep in mind that pytorch won’t release the memory after using it to make follow up allocation faster.
So this does not necessarily shows the overall needed memory but the maximum allocated memory.

Also you might want to check using torch.backends.cudnn.benchmark to allow it to find good kernels.
I am sure cudnn is heavily optimized for resnet’s convolutions :wink:

hmm ok, but we are talking here about a 3x3 convolution with 3 input and 128 output channels. So less than 4000 parameters, compared to more than 10,000,000 of resnet50. For a similar operation it’s allocating twice the amount of memory, while the model only has 0.04% of the parameters?

Depending on the algorithm you use to perform the convolution, the intermediary results can be very (very) big. In particular, if you do it as a matrix multiplication, the unfolded Tensors will be huge here.

The reason I recommend benchmark is that it will adapt the algorithm to the inputs instead of simply using the default one. This should help you get lower memory (and better speed after the first call ran the benchmark).