CUDA out of memory after 23 iterations

from tqdm import tqdm
import torch.nn.functional as F
import os

best_val_loss = 1e+6
model.train()
for epoch in range(num_epochs):
    loss_train = 0
    for data in tqdm(tokenized_dataset, desc=f"training epoch = {epoch}"):  
        # data['input_ids'].shape = (variable, 512)
        probs = torch.tensor(0.0).to(Device)
        labels = torch.tensor(data['Anorexia']).to(Device)
        for i in range(0, len(data['input_ids'])):
            tokens = {'input_ids': torch.LongTensor(data['input_ids'][i]).to(Device), 'attention_mask': torch.LongTensor(data['attention_mask'][i]).to(Device)}
            # tokens['input_ids'].shape = (512), same for token['attention_mask']
            model_output = model(**tokens)
            # model_output = SequenceClassifierOutput(loss=None, logits=tensor([[-0.8730,  0.6156]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
            prob = F.softmax(model_output['logits'], dim=-1)
            prob = prob[:, 1].mean()#.unsqueeze(0).unsqueeze(0)
            probs += prob

        loss = criterion(probs, labels.float())
        loss_train += loss.item()
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print(f"training_loss = {loss_train/len(train_dataset)}")

this is the training loop. I am training a sequence classifier model on a 32 GB GPU.

from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

The text is very long and therefore I split the text into subtext and stacked the tokenized output together. I am unable to use the data loader because of the different shapes of input_ids and attention_mask.

after 23 iterations, it gets cuda out of memory whereas I am only passing a tokenized output of shape (512) one at a time to the model.

Can someone help me to resolve the error, because I think it is not a memory issue. some memory is getting accumulated.

You are also accumulating the computation graphs by adding probs += prob including all intermediate forward activations needed to compute the gradients in the backward pass.
Since you are running OOM in this loop you might want to reduce the number of iterations or call backward() inside the loop thus accumulating the gradients.

Thank you @ptrblck for your response.
Is there any possible method so that I can get the mean of prob for all the frames in the data and backpass its loss without CUDA out of memory?

Basically what I want is for the loss should be computed for data (getting from the data loader) as a single entity and back propagate, instead of computing the loss for a single frame and then back-propagating.

I’m not sure I understand your request, but you won’t be able to pass the entire dataset through the model and backpropagate once since you are already running out of memory. Calling backward on smaller iterations should work, but would use more compute (since you need to compute the backward pass more often).

loss = criterion(probs, labels.float())
loss_train += loss.item()
optimizer.zero_grad()
loss.backward()

Push the above lines into the scope of for loop, in that way gradient calculation is done for each data point passing thru the model. It must avoid the accumulation and corresponding OOM
.

1 Like