How to fix RuntimeError: CUDA out of memory

I got this error when the program running 3 batches, sometimes 33 batches, even I set the batch_size=1. I read some other topics but still don’t know how to fix it, any one can help me? thanks a lot.
my model is something like this:

def forward(self, input_id: torch.LongTensor,
                token_type_id: torch.LongTensor,
                attention_mask: torch.LongTensor) -> torch.FloatTensor:
        print("mem free: {} Mb".format(get_gpu_memory(self.cuda_id)))
        out = self.bert(input_ids=input_id,
                        token_type_ids=token_type_id,
                        attention_mask=attention_mask) # huggingface's transformers
        if self.use_bert_predictions:
            out = out[0]
        else:
            out = out[1]
        out = self.encoder(out.unsqueeze(0))  # it's a LSTM
        out = out[:, -1, :]
        out = self.fc1(out)
        out = F.relu(out)
        out = self.fc2(out)
        return out

my trainer is like this:

for step, trains in enumerate(train_iter):
            encodes, label = trains
            input_ids = torch.stack([item['input_ids']
                                     for item in encodes[0]]).squeeze(1)
            token_type_ids = torch.stack(
                [item['token_type_ids'] for item in encodes[0]]).squeeze(1)
            attention_masks = torch.stack(
                [item['attention_mask'] for item in encodes[0]]).squeeze(1)
            # outputs = []
            logit = model(input_ids.cuda(config.Commom.cuda_id),
                          token_type_ids.cuda(config.Commom.cuda_id),
                          attention_masks.cuda(config.Commom.cuda_id))
            loss = loss_fct(logit, label.cuda(config.Commom.cuda_id))
            # outputs.append(logit.cpu().tolist())
            loss /= config.Train.grad_accumulate_step
            loss.backward()
            if (step + 1) % config.Train.grad_accumulate_step == 0:
                optimizer.step()
                optimizer.zero_grad()
           .....

and my error:


before the error, the GPU still have 1527Mb free, but can’t allocate 28.00 MiB, don’t know why

If this error seems to be raised “randomly”, this might point to e.g. a specifically large input batch.
If you are dealing with a variable sequence length, you might want to truncate the samples to a fixed size.
Also make sure you are not storing any tensors, which are still attached to the computation graph, during the training.

Thanks ptrblck. In my machine, it’s always 3 batches, but in another machine that has the same hardware, it’s 33 batches. Today, I change the model.py and then turns to 40 batches in my machine. I haven’t store any tensors

def forward(self, input_id: torch.LongTensor,
                token_type_id: torch.LongTensor,
                attention_mask: torch.LongTensor,
                length: Tuple[int]) -> torch.FloatTensor:
        if self.use_bert_predictions:
            out, _ = self.bert(input_ids=input_id,
                               token_type_ids=token_type_id,
                               attention_mask=attention_mask)
        else:
            _, out = self.bert(input_ids=input_id,
                               token_type_ids=token_type_id,
                               attention_mask=attention_mask)
        del _
        out = out.split_with_sizes(length)
        seq_lengths = torch.LongTensor([x for x in map(len, out)])
        out = torch.nn.utils.rnn.pad_sequence(
                            out, batch_first=True)
        out = torch.nn.utils.rnn.pack_padded_sequence(
                        out, seq_lengths.cpu().numpy(), batch_first=True, enforce_sorted=False)
        out, _ = self.encoder(out)
        del _
        del seq_lengths
        out, index = torch.nn.utils.rnn.pad_packed_sequence(out, batch_first=True)
        assert len(out) == len(index)
        temp = []
        for i in range(len(out)):
            temp.append(out[i][index[i]-1])
        out = torch.stack(temp)
        del temp
        gc.collect()
        out = self.fc1(out)
        out = F.relu(out)
        out = self.fc2(out)
        return out

I fix the sequence with 200 tokens, at another mission, use the same GPU, it’s well. My GPU is RTX 2080Ti, even I use DDP with 2 or 3 RTX 2080Ti, I also get the same error.

Hi ptrblck, I have another question. The size of data that feed in my model may be different at each batch, can this cause the problem above? Thank you very much

Yes, this might cause a memory spike and thus raise the out of memory issue, so try to make sure to keep the input shapes at a “reasonable” value.