Running out of memory on both CPU and GPU, can't figure out what I'm doing wrong

equ1 · February 15, 2021, 6:52am

I’m using Pytorch Lightning. Here’s the model definition:

import torch
from torch import nn

# creates network class
class Net(pl.LightningModule):
    def __init__(self):
        super().__init__()

        # defines conv layers
        self.conv_layer_b1 = nn.Sequential(
            nn.Conv2d(in_channels=3, out_channels=32,
                      kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Flatten(),
        )

        # passes dummy x matrix to find the input size of the fc layer
        x = torch.randn(1, 3, 800, 600)
        self._to_linear = None
        self.forward(x)

        # defines fc layer
        self.fc_layer = nn.Sequential(
            nn.Linear(in_features=self._to_linear,
                      out_features=256),
            nn.ReLU(),
            nn.Linear(256, 5),
        )

        # defines accuracy metric
        self.accuracy = pl.metrics.Accuracy()
        self.confusion_matrix = pl.metrics.ConfusionMatrix(num_classes=5)

    def forward(self, x):
        x = self.conv_layer_b1(x)

        if self._to_linear is None:
            # does not run fc layer if input size is not determined yet
            self._to_linear = x.shape[1]
        else:
            x = self.fc_layer(x)
        return x

    def cross_entropy_loss(self, logits, y):
        criterion = nn.CrossEntropyLoss()
        
        return criterion(logits, y)

    def training_step(self, train_batch, batch_idx):
        x, y = train_batch
        logits = self.forward(x)

        train_loss = self.cross_entropy_loss(logits, y)
        train_acc = self.accuracy(logits, y)
        train_cm = self.confusion_matrix(logits, y)

        self.log('train_loss', train_loss)
        self.log('train_acc', train_acc)
        self.log('train_cm', train_cm)

        return train_loss

    def validation_step(self, val_batch, batch_idx):
        x, y = val_batch
        logits = self.forward(x)

        val_loss = self.cross_entropy_loss(logits, y)
        val_acc = self.accuracy(logits, y)

        return {'val_loss': val_loss, 'val_acc': val_acc}

    def validation_epoch_end(self, outputs):
        avg_val_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        avg_val_acc = torch.stack([x['val_acc'] for x in outputs]).mean()

        self.log("val_loss", avg_val_loss)
        self.log("val_acc", avg_val_acc)

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=0.0008)

        return optimizer

The issue is probably not the machine since I’m using a cloud instance with 60 GBs of RAM and 12 GBs of VRAM. Whenever I run this model even for a single epoch, I get an out of memory error. On the CPU it looks like this:

RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 1966080000 bytes. Error code 12 (Cannot allocate memory)

and on the GPU it looks like this:

RuntimeError: CUDA out of memory. Tried to allocate 7.32 GiB (GPU 0; 11.17 GiB total capacity; 4.00 KiB already allocated; 2.56 GiB free; 2.00 MiB reserved in total by PyTorch)

Clearing the cache and reducing the batch size did not work. I’m a novice so clearly something here is exploding but I can’t tell what. Any help would be appreciated.

Thank you!

ptrblck · February 15, 2021, 10:24am

I cannot see any obvious issues in the code, but I’m also not deeply familiar with Lightning.
Are you seeing the same issue using a plain PyTorch model?

equ1 · February 16, 2021, 6:33am

Thanks for that, it helped a lot to know that the code fundamentally is not wrong. I figured out that the issue is simply that there are not enough conv-relu-pool layers to reduce the feature map to a reasonable size…the model had 900 million parameters haha.