Cuda out of Memory error Training ImageNet

I am trying and testing a repository on ImageNet datasets which is actually designed for small datasets.
I am trying for ILSVRC 2012 (Training Image are 1.2 Million)

I tried with Batch Size = 64 #32 and 128 also
I also tried my experiment with ResNet18 and RestNet50 both
I tried with a bigger GPU which has 128GB RAM and with 256GB RAM
I am only doing Image Classification by Random Method

CUDA_VISIBLE_DEVICES = 0
NUM_TRAIN = 120000
BATCH = 64
SUBSET = 100000
ADDENDUM = 2500

elif dataset == 'imagenet':
        test_transform = T.Compose([
            T.ToTensor(),               
            T.CenterCrop((224, 224)),   
            T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) 
        ])

        data_train = ImageFolder(root='./data/toy_dataset/train', transform=train_transform) # todo: change back to train
        data_test = ImageFolder(root='./data/toy_dataset/train', transform=test_transform)

        data_unlabeled = MyDataset(dataset, True, test_transform)
        NO_CLASSES = 200
        adden = 100
        no_train = NUM_TRAIN

    return data_train, data_unlabeled, data_test, adden, NO_CLASSES, no_train
if method == 'Random':
        arg = np.random.randint(SUBSET, size=SUBSET)

Could you post a small script that reproduces the issue please? I’m also curious what the exact hardware setup is as I’m not familiar with any GPU hardware with 256GiB or even 128GiB RAM.

I would consider decreasing the batch size further (though keep in mind at very small sizes normalizations like batchnorm might misbehave).

Traceback (most recent call last):
  File "main.py", line 151, in <module>
    train(models, method, criterion, optimizers, schedulers, dataloaders, args.no_of_epochs, EPOCHL)
  File "/po1/kanza.ali/workspace/OtherDataSets/ImageNet/train_test.py", line 112, in train
    loss = train_epoch(models, method, criterion, optimizers, dataloaders, epoch, epoch_loss)
  File "/po1/kanza.ali/workspace/OtherDataSets/ImageNet/train_test.py", line 78, in train_epoch
    scores, _, features = models(inputs)
  File "/home/kanza.ali/anaconda3/envs/optuna/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/kanza.ali/anaconda3/envs/optuna/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/kanza.ali/anaconda3/envs/optuna/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/po1/kanza.ali/workspace/OtherDataSets/ImageNet/models/resnet.py", line 113, in forward
    out2 = self.layer2(out1)
  File "/home/kanza.ali/anaconda3/envs/optuna/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/kanza.ali/anaconda3/envs/optuna/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/home/kanza.ali/anaconda3/envs/optuna/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/po1/kanza.ali/workspace/OtherDataSets/ImageNet/models/resnet.py", line 31, in forward
    out += self.shortcut(x)
  File "/home/kanza.ali/anaconda3/envs/optuna/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/kanza.ali/anaconda3/envs/optuna/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/home/kanza.ali/anaconda3/envs/optuna/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/kanza.ali/anaconda3/envs/optuna/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 446, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/kanza.ali/anaconda3/envs/optuna/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 442, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: CUDA out of memory. Tried to allocate 392.00 MiB (GPU 0; 10.76 GiB total capacity; 9.27 GiB already allocated; 129.44 MiB fre$

The image size is different in ImageNet datasets, So I did data augmentations and cropped (224,224), when I use batch size 32 then it gives me the following error

RuntimeError: mat1 and mat2 shapes cannot be multiplied (32x25088 and 512x1000)

What is the exact model architecture in this case? Something seems incorrect as the batch size shouldn’t affect shape compatibility in matmuls in a classification model.

are you talking about case 2 with batch 32? If yes I am using ResNet32

Could you post the code defining the model? I’m not sure it is expected that a reference implementation would encounter a shape mismatch depending on the batch size.

class BasicBlock(nn.Module):
    expansion = 1

    def __init__(self, in_planes, planes, stride=1):
        super(BasicBlock, self).__init__()
        self.conv1 = nn.Conv2d(in_planes, planes, kernel_size=3, stride=stride, padding=1, bias=False)
        #print(self.conv1)
        self.bn1 = nn.BatchNorm2d(planes)
        self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=1, padding=1, bias=False)
        #print(self.conv2)
        self.bn2 = nn.BatchNorm2d(planes)

        self.shortcut = nn.Sequential()
        if stride != 1 or in_planes != self.expansion*planes:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_planes, self.expansion*planes, kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(self.expansion*planes)
            )

    def forward(self, x):
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out += self.shortcut(x)
        out = F.relu(out)
        #print(out)
        return out
class ResNet(nn.Module):
    def __init__(self, block, num_blocks, num_classes=10):    #200 Imagenet Classes 
        super(ResNet, self).__init__()
        self.in_planes = 64

        self.conv1 = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(64)
        self.layer1 = self._make_layer(block, 64, num_blocks[0], stride=1)
        self.layer2 = self._make_layer(block, 128, num_blocks[1], stride=2)
        self.layer3 = self._make_layer(block, 256, num_blocks[2], stride=2)
        self.layer4 = self._make_layer(block, 512, num_blocks[3], stride=2)
        self.linear = nn.Linear(512*block.expansion, num_classes)
        #self.linear2 = nn.Linear(1000, num_classes)

    def _make_layer(self, block, planes, num_blocks, stride):
        strides = [stride] + [1]*(num_blocks-1)
        layers = []
        for stride in strides:
            layers.append(block(self.in_planes, planes, stride))
            self.in_planes = planes * block.expansion
        return nn.Sequential(*layers)

    def forward(self, x):
        out = F.relu(self.bn1(self.conv1(x)))
        out1 = self.layer1(out)
        out2 = self.layer2(out1)
        out3 = self.layer3(out2)
        out4 = self.layer4(out3)
        out = F.avg_pool2d(out4, 4)
        outf = out.view(out.size(0), -1)
        # outl = self.linear(outf)
        out = self.linear(outf)
        return out, outf, [out1, out2, out3, out4]

def ResNet18(num_classes = 10):    #200 Imagenet Classes 
    return ResNet(BasicBlock, [2,2,2,2], num_classes)

Where is the definition of ResNet?

Can you please check now the above code

The use of average pooling here looks strange, as I’m not sure it is shape-agnostic. For example, check the reference implementation in torchvision which uses adaptive average pooling to guarantee an output spatial shape of 1x1: torchvision.models.resnet — Torchvision 0.12 documentation

What batch size are you using for training your model ?

I have tried with batch size 64 and 128 , both are giving same error

Try reducing batch size maybe input data is consuming lot of memory

Thank you so much for your response, I resolved the issue by using Data Parallel