How to fix RuntimeError: CUDA out of memory

zyh3826 · December 22, 2020, 2:43am

I got this error when the program running 3 batches, sometimes 33 batches, even I set the batch_size=1. I read some other topics but still don’t know how to fix it, any one can help me? thanks a lot.
my model is something like this:

def forward(self, input_id: torch.LongTensor,
                token_type_id: torch.LongTensor,
                attention_mask: torch.LongTensor) -> torch.FloatTensor:
        print("mem free: {} Mb".format(get_gpu_memory(self.cuda_id)))
        out = self.bert(input_ids=input_id,
                        token_type_ids=token_type_id,
                        attention_mask=attention_mask) # huggingface's transformers
        if self.use_bert_predictions:
            out = out[0]
        else:
            out = out[1]
        out = self.encoder(out.unsqueeze(0))  # it's a LSTM
        out = out[:, -1, :]
        out = self.fc1(out)
        out = F.relu(out)
        out = self.fc2(out)
        return out

my trainer is like this:

for step, trains in enumerate(train_iter):
            encodes, label = trains
            input_ids = torch.stack([item['input_ids']
                                     for item in encodes[0]]).squeeze(1)
            token_type_ids = torch.stack(
                [item['token_type_ids'] for item in encodes[0]]).squeeze(1)
            attention_masks = torch.stack(
                [item['attention_mask'] for item in encodes[0]]).squeeze(1)
            # outputs = []
            logit = model(input_ids.cuda(config.Commom.cuda_id),
                          token_type_ids.cuda(config.Commom.cuda_id),
                          attention_masks.cuda(config.Commom.cuda_id))
            loss = loss_fct(logit, label.cuda(config.Commom.cuda_id))
            # outputs.append(logit.cpu().tolist())
            loss /= config.Train.grad_accumulate_step
            loss.backward()
            if (step + 1) % config.Train.grad_accumulate_step == 0:
                optimizer.step()
                optimizer.zero_grad()
           .....

and my error:

before the error, the GPU still have 1527Mb free, but can’t allocate 28.00 MiB， don’t know why

ptrblck · December 22, 2020, 8:20pm

If this error seems to be raised “randomly”, this might point to e.g. a specifically large input batch.
If you are dealing with a variable sequence length, you might want to truncate the samples to a fixed size.
Also make sure you are not storing any tensors, which are still attached to the computation graph, during the training.

zyh3826 · December 23, 2020, 7:21am

Thanks ptrblck. In my machine, it’s always 3 batches, but in another machine that has the same hardware, it’s 33 batches. Today, I change the model.py and then turns to 40 batches in my machine. I haven’t store any tensors

def forward(self, input_id: torch.LongTensor,
                token_type_id: torch.LongTensor,
                attention_mask: torch.LongTensor,
                length: Tuple[int]) -> torch.FloatTensor:
        if self.use_bert_predictions:
            out, _ = self.bert(input_ids=input_id,
                               token_type_ids=token_type_id,
                               attention_mask=attention_mask)
        else:
            _, out = self.bert(input_ids=input_id,
                               token_type_ids=token_type_id,
                               attention_mask=attention_mask)
        del _
        out = out.split_with_sizes(length)
        seq_lengths = torch.LongTensor([x for x in map(len, out)])
        out = torch.nn.utils.rnn.pad_sequence(
                            out, batch_first=True)
        out = torch.nn.utils.rnn.pack_padded_sequence(
                        out, seq_lengths.cpu().numpy(), batch_first=True, enforce_sorted=False)
        out, _ = self.encoder(out)
        del _
        del seq_lengths
        out, index = torch.nn.utils.rnn.pad_packed_sequence(out, batch_first=True)
        assert len(out) == len(index)
        temp = []
        for i in range(len(out)):
            temp.append(out[i][index[i]-1])
        out = torch.stack(temp)
        del temp
        gc.collect()
        out = self.fc1(out)
        out = F.relu(out)
        out = self.fc2(out)
        return out

zyh3826 · December 23, 2020, 7:30am

I fix the sequence with 200 tokens, at another mission, use the same GPU, it’s well. My GPU is RTX 2080Ti, even I use DDP with 2 or 3 RTX 2080Ti, I also get the same error.

zyh3826 · December 23, 2020, 9:08am

Hi ptrblck, I have another question. The size of data that feed in my model may be different at each batch, can this cause the problem above? Thank you very much

ptrblck · December 23, 2020, 7:19pm

Yes, this might cause a memory spike and thus raise the out of memory issue, so try to make sure to keep the input shapes at a “reasonable” value.