Error: Cuda Out of Memory after training on 2.5 million images; works find on 150K images

Hi, I have seen a couple of postings on this error, but no one has posted their solution in detail. I am getting a strange ‘out of memory’ error when I try to run my imaging pipeline on 2.5 million images, versus on 150,000 images (where it works just fine).

This is the error message:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-22-f844cad716f4> in <module>
     19     return log.plot_epochs(log=True)
     20 
---> 21 training_step(n_epochs, data, encoder, decoder, optimizer, criterion)

<ipython-input-22-f844cad716f4> in training_step(n_epochs, data, encoder, decoder, optimizer, criterion)
      4         N = len(trn_dl)
      5         for i, data in enumerate(trn_dl):
----> 6             trn_loss = train_batch(data, encoder, decoder, optimizer, criterion)
      7             #trn_loss = train_batch(data, encoder, decoder, optimizer, criterion, batch_size)
      8             pos = epoch + (1+i)/N

<ipython-input-16-b949d8041438> in train_batch(data, encoder, decoder, optimizer, criterion)
     13     encoder.zero_grad()
     14     loss.backward()
---> 15     optimizer.step()
     16     return loss

/opt/conda/lib/python3.8/site-packages/torch/optim/optimizer.py in wrapper(*args, **kwargs)
     86                 profile_name = "Optimizer.step#{}.step".format(obj.__class__.__name__)
     87                 with torch.autograd.profiler.record_function(profile_name):
---> 88                     return func(*args, **kwargs)
     89             return wrapper
     90 

/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py in decorate_context(*args, **kwargs)
     26         def decorate_context(*args, **kwargs):
     27             with self.__class__():
---> 28                 return func(*args, **kwargs)
     29         return cast(F, decorate_context)
     30 

/opt/conda/lib/python3.8/site-packages/torch/optim/adamw.py in step(self, closure)
     90                     state['step'] = 0
     91                     # Exponential moving average of gradient values
---> 92                     state['exp_avg'] = torch.zeros_like(p, memory_format=torch.preserve_format)
     93                     # Exponential moving average of squared gradient values
     94                     state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)

RuntimeError: CUDA out of memory. Tried to allocate 2.19 GiB (GPU 0; 15.78 GiB total capacity; 14.21 GiB already allocated; 144.75 MiB free; 14.29 GiB reserved in total by PyTorch)

The batching size for the dataloader is 32 but I’ve tried anywhere between 5 and 32 and it makes no difference in the Cuda memory usage. I’ve found that after my Resnet, a lot of memory is being used up.

The basic nn stack is:

  1. encoder: resnet with final fc layer removed to expose only features from transfer learning
  2. decoder: lstm seq-to-seq with one linear layer

Goal is to predict image captions from a set of images up to 2.5 million.

the training code is below:

def training_step(n_epochs, data, encoder, decoder, optimizer, criterion):
    for epoch in range(n_epochs):
        if epoch == 5: optimizer = torch.optim.AdamW(params, lr=1e-4)
        N = len(trn_dl)
        for i, data in enumerate(trn_dl):
            trn_loss = train_batch(data, encoder, decoder, optimizer, criterion)
            #trn_loss = train_batch(data, encoder, decoder, optimizer, criterion, batch_size)
            pos = epoch + (1+i)/N
            log.record(pos=pos, trn_loss=trn_loss, end='\r')

        N = len(val_dl)
        for i, data in enumerate(val_dl):
            val_loss = validate_batch(data, encoder, decoder, criterion)
            #val_loss = validate_batch(data, encoder, decoder, criterion, batch_size)
            pos = epoch + (1+i)/N
            log.record(pos=pos, val_loss=val_loss, end='\r')

        log.report_avgs(epoch+1)
    return log.plot_epochs(log=True)

training_step(n_epochs, data, encoder, decoder, optimizer, criterion)

The memory usage is as follows:
0 after basic data loading of images using the custom DL
Jumps to 7295012864 after sending the encoder and decoder to(device).

Tried the following to empty the cache and it did nothing:

print(torch.cuda.memory_allocated())
print(torch.cuda.memory_cached())
torch.cuda.empty_cache()
print(torch.cuda.memory_cached())

If we figure out the problem, I will post a detailed answer for future forum users. Appreciate any suggestions on how to fix this.

I’m not sure how log actually stores trn_loss and val_loss, but you might want to do trn_loss.detach().item() and val_loss.detach().item() here instead to avoid keeping tensors that aren’t needed anymore around.

Tried this elsewhere in the code and it told me only one element tensors can be converted to python scalars. I think each of the tensors has 3 elements. Any workaround?

Does changing this to .detach().clone() work?

It doesn’t. Hum…here’s the piece of code I was trying that one…

def train_batch(data, encoder, decoder, optimizer, criterion):
    encoder.train()
    decoder.train()
    images, captions, lengths = data
    images = images.to(device)
    captions = captions.to(device)
    #images = image.to('cpu')
    #captions = captions.to('cpu')
    targets = pack_padded_sequence(captions, lengths.cpu(), batch_first=True)[0]
    #targets = pack_padded_sequence(captions, lengths.gpu(), batch_first=True)[0]
    features = encoder(images)
    outputs = decoder(features, captions, lengths)
    loss = criterion(outputs, targets)
    targets.detach.clone() #DEBUG
    features.detach.clone() #DEBUG
    decoder.zero_grad()
    encoder.zero_grad()
    loss.backward()
    optimizer.step()
    return loss

Right, I was referring to these lines:

log.record(pos=pos, trn_loss=trn_loss, end='\r')
log.record(pos=pos, val_loss=val_loss, end='\r')

and changing them to something like

log.record(pos=pos, trn_loss=trn_loss.detach().clone(), end='\r')
log.record(pos=pos, val_loss=val_loss.detach().clone(), end='\r')

It didn’t change things. Perhaps there’s a way to batch the training such that I iterate through maybe 5000 images at a time, and then save that to disk each time, clearing up some cache on the GPU?

Is the OOM happening at the very first batch of the very first epoch or is it happening after a while?

There shouldn’t be any issues with GPU caching provided there isn’t some sort of unintentional “leak” somewhere due to tensors being kept around when they aren’t used anymore. From the dataloader usage here only a few batches at most should be on the GPU.

You might want to remove parts of the training loop to incrementally check if any functions are causing the memory usage to grow with every iteration or if the memory usage is a lot higher than expected for a specific function.

2 Likes

Great idea! Newbie question here: do you mean something like:

def train_batch(data, encoder, decoder, optimizer, criterion):
    encoder.train()
    decoder.train()
    images, captions, lengths = data
    images = images.to(device)
    captions = captions.to(device)
    #images = image.to('cpu')
    #captions = captions.to('cpu')
    targets = pack_padded_sequence(captions, lengths.cpu(), batch_first=True)[0]
    features = encoder(images)
    outputs = decoder(features, captions, lengths)
    loss = criterion(outputs, targets)
    #targets.detach.clone() #DEBUG
    #features.detach.clone() #DEBUG
    decoder.zero_grad()
    encoder.zero_grad()
    loss.backward()
    optimizer.step()
    return loss

and remove for example decoder.zero_grad() to see if that is where the memory leak exists? You’re 100% right it shows out of memory error without even starting the training loop.

Sure, although in this particular case I wouldn’t suspect any of the zero_grad operations to be the culprit as they should be inplace operations that don’t allocate any memory.

thank you so much. I saw that the encoder and decoder where inside the training loop. That wasn’t necessary. Now it appears to be training correctly on the images.

def training_step(n_epochs, data, encoder, decoder, optimizer, criterion):
    for epoch in range(n_epochs):
        #if epoch == 5: optimizer = torch.optim.AdamW(params, lr=1e-4)
        N = len(trn_dl)
        encoder.train()
        decoder.train()
        for i, data in enumerate(trn_dl):
            images, captions, lengths = data
            #images = images.to(device)
            #captions = captions.to(device)
            targets = pack_padded_sequence(captions, lengths.cpu(), batch_first=True)[0]
            features = encoder(images)
            outputs = decoder(features, captions, lengths)
            trn_loss = criterion(outputs, targets)
            decoder.zero_grad()
            encoder.zero_grad()
            #trn_loss.backward()
            optimizer.step()
            pos = epoch + (1+i)/N
            log.record(pos=pos, trn_loss=trn_loss, end='\r')
     

        N = len(val_dl)
        for i, data in enumerate(val_dl):
            val_loss = validate_batch(data, encoder, decoder, criterion)
            #val_loss.detach().clone()
            val_loss = validate_batch(data, encoder, decoder, criterion, batch_size)
            pos = epoch + (1+i)/N
            log.record(pos=pos, val_loss=val_loss, end='\r')
        #    torch.cuda.empty_cache()

        #log.report_avgs(epoch+1)
    #return log.plot_epochs(log=True)

training_step(n_epochs, data, encoder, decoder, optimizer, criterion)
1 Like