Why CUDA out of memory during training?

ilyes · August 27, 2019, 4:36pm

Hello,

Here’s a simple code:

class Identity(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        return x
    
class CNNLSTM(nn.Module):
    def __init__(self):
        super(CNNLSTM, self).__init__()
        self.BackBone = models.resnet18(pretrained= True)
        num_ftrs = self.BackBone.fc.in_features
        self.BackBone.fc = Identity()
        self.lstm = nn.LSTM(512, 512, batch_first  = True)
        self.fc = nn.Linear(512, 1)
    def forward(self, video):
        batch_size, time_steps, C, H, W = video.size()
        c_in = video.view(batch_size * time_steps, C, H, W)
        c_out = self.BackBone(c_in)
        r_in = c_out.view(batch_size, time_steps, -1)
        r_out, _  = self.lstm(r_in)
        output = torch.sigmoid(self.fc(r_out[:, -1, :]))
        return output.squeeze()

If I run the training loop without actually using the model, means only fetching the batches for a number of epochs my GPU uses 1.1 GB out of 4.0 GB.
means I only do this:

for epoch in range(num_epochs):
        print('Epoch {}/{}'.format(epoch+1, num_epochs))
        for i, (inputs, labels) in enumerate(dataloader):
            inputs = inputs.to(device, non_blocking=True)
            labels = labels.to(device , non_blocking=True)

If I call my model however on the batch, I get CUDA out of memory. Means if I only add these to my loop:

            optimizer.zero_grad()
            outputs = model(inputs)

where does the 3 GB go! do the gradients occupy memory. Because I thought once you call the model in:

model = CNNLSTM()
model.to(device)

The graphs and the gradients have already allocated their memory?
Thank you

spanev · August 27, 2019, 4:58pm

When you actually execute the model, the computation graph gets built dynamically (recording all the operations).

What (I think…) takes most of the memory is the outputs of some operations (the edges of the graph). Those will be needed by autograd to compute the gradients so are kept in memory until you call outputs.backward().

Please find more details here and autograd concept better explained than by me.

ilyes · August 28, 2019, 9:57am

I will check the link, Thank you!