RuntimeError: CUDA out of memory. Tried to allocate 1.12 MiB (GPU 0; 11.91 GiB total capacity; 5.52 GiB already allocated; 2.06 MiB free; 184.00 KiB cached)

DeepLearner17 · April 9, 2019, 2:39pm

Hello,
I’m running my code on Tesla P100 (16 Go). l get stuck at CUDA out of memory, l get the following error after 17 epochs of training

line 2234, in forward
    x = nn.AvgPool1d(90, stride=None)(x)
  File "/local/anaconda3/envs/torch_edward/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/local/anaconda3/envs/torch_edward/lib/python3.7/site-packages/torch/nn/modules/pooling.py", line 499, in forward
    self.count_include_pad)
RuntimeError: CUDA out of memory. Tried to allocate 1.12 MiB (GPU 0; 11.91 GiB total capacity; 5.52 GiB already allocated; 2.06 MiB free; 184.00 KiB cached)

The error comes from forward function : " x = nn.AvgPool1d(90, stride=None)(x)"

def forward(): 
   x=self.layer1(x)
   x=self.layer2(x)
  .
  .
  .
  x=self.layer10(x)


  x=nn.AvgPool1d(90,stride=None)(x) # here l get the error

  x=x.squeeze(2)

  x=self.fc1(x)
  return x

Any cue ?

Thank you

ezyang · April 9, 2019, 3:08pm

Does reducing your batch size help? Can you post your full training script? Have you tried the diagnoses at https://pytorch.org/docs/stable/notes/faq.html#my-model-reports-cuda-runtime-error-2-out-of-memory ?

DeepLearner17 · April 10, 2019, 1:54pm

Hi @ezyang,

I solved that by deleting manually my variables after batch iterations.

Another common solution is :

try:

            try:
                y = net.forward(train_x)
            except RuntimeError as e:
                if 'out of memory' in str(e):
                    print('| WARNING: ran out of memory, retrying batch',sys.stdout)
                    sys.stdout.flush()
                    for p in net.parameters():
                        if p.grad is not None:
                            del p.grad  # free some memory
                    torch.cuda.empty_cache()
                    y= net.forward(train_x)
                else:
                    raise e

Or “Preview (Nightly)” PyTorch Build

bibinmjose · January 23, 2020, 6:17pm

Weer you able to find out why this is happening? it is very strange since I call optim.zero_grad() inside the training loop. Could it be because cuda is not purging the data from dataloader in every loop and getting accumulated?

ptrblck · January 24, 2020, 6:19am

Could you check, if wrapping the training step in a function helps, as Python uses function scoping as described here.

bibinmjose · January 24, 2020, 9:01am

Hi @ptrblck, thanks for the reply. I have it already wrapped in train function and call it from the __main__ . I presume this is what you meant? Stil the same error. I tried everything from changing batch_size and reduced the number of parameters. But it throws an error before an epoch completes usually towards the end of epoch like below. also tried calling torch.cuda.empty_cache() inside the training loop with no success either.

Error Message : RuntimeError: CUDA out of memory. Tried to allocate 4.37 GiB (GPU 0; 11.17 GiB total capacity; 4.78 GiB already allocated; 1.58 GiB free; 4.49 GiB cached)
```
===========
current memory allocated: 4693.4131
Max memory allocated: 9319.9683
Cached memory: 9490.0000
===========
```

def train(args):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    train_dataloader, test_dataloader, embedding_tuple = get_loader(args)
    model = SmallEmbedNN(embedding_tuple, len(args.cont_cols), 1,
                         emb_dropout=0.1, lin_layer_dropouts=0.1).to(device)
    
    criterion_l1 = L1Loss()
    
    # sparse optimizer for embeddings and normal optimizer for rest
    embed_sparse_vec = 'emb_layer.weight'
    opt = torch.optim.Adam([weights for name, weights in model.state_dict().items() \
                            if name!=embed_sparse_vec], lr=args.lr)
    optSparse = torch.optim.SparseAdam([model.state_dict()[embed_sparse_vec]], 
                                       lr=args.lr)
    train_loss_list=[]
    test_loss_list=[]
    
    try:
        for epoch in range(args.epochs):
            model.train()
            for _dict in tqdm(train_dataloader):
                # pass vars to cuda device
                _dict = {key:var.to(device) for key, var in _dict.items()}
                
                # zeroing parameter gradients
                opt.zero_grad()
                optSparse.zero_grad()

                # Forward Pass
                preds = model(_dict['cont'], _dict['cat'])
                loss_train_l1 = criterion_l1(preds, _dict['target'], 
                yr_weights=_dict['wts']) #L1 loss

                # Backward Pass and Optimization
                loss_train_l1.backward()
                opt.step()
                optSparse.step()

                # batch train loss
                train_loss = loss_train_l1.item()
                logger.info(f'Epoch:{epoch} train[batch] loss(L1):{train_loss:.4f}')
                # defrag cached memory
                torch.cuda.empty_cache()

            # append train loss
            train_loss_list.append(train_loss)

            # evaluate test-set and append test loss
            test_loss = eval_test(model, test_dataloader, criterion_l1, device)
            test_loss_list.append(test_loss)
            if epoch%1==0:
                logger.info(f'Epoch:{epoch}\ttrain_loss:{train_loss_list[-1]:.6f} \
                        \ttest_loss:{test_loss_list[-1]:.6f}')

            # earlystopping if test loss is not improving in last 10 epochs
            if min(test_loss_list) < min(test_loss_list[-20:]):
                break
    except Exception as e:
        logger.exception(e)
        memory_log(logger)

bibinmjose · January 24, 2020, 9:04am

@ptrblck Is there any particular method/function to purge the training data out of CUDA after each batch pass? Or does it get emptied by some sort of garbage collection? Or could it be anything to do with sparse optimizer?

bibinmjose · January 24, 2020, 3:34pm

@ptrblck Found what is causing the error. It is the Sparse optimizer. When I do not use a sparse embedding layer or optimizer, there seem to be no problem at all. Do you know any such issues relating to sparse optimizer?

@smth Hi Soumith, Have you seen sparse optimizer causing CUDA error. I am guessing it is caused by accumulating params inside GPU inside training loop? I highly appreciate any help in the correct direction. Thanks a lot!

ptrblck · January 26, 2020, 2:27am

I’m not aware of any issues, which might create unnecessary OOM errors.
If I’m not mistaken, SparseAdam will lazily compute the the updates as:

In this variant, only moments that show up in the gradient get updated, and
only those portions of the gradient get applied to the parameters.

Could you just run out of memory for a specific input, which uses more entries in your sparse input?

bibinmjose · January 26, 2020, 11:11am

Hi @ptrblck Thanks for the reply !
I am not sure if I am running out of memory when I hit a specific case, since I tried out with various sizes of train batches updating different number of parameters. Besides, error always throws towards the end of 1st epoch, never in the beginning. I also observed that memory keeps on increasing(not monotonically but trendwise) when using sparse optimizer. Anyways, I will keep investigating.

ptrblck · January 26, 2020, 6:45pm

Could you post a code snippet showing this behavior?
How did you define eval_test?
Could you check that all tensors, which you are appending to a list, are properly detached from the computation graph?

bibinmjose · January 28, 2020, 12:02am

@ptrblck

def eval_test(model, test_dataloader, criterion, device):
    """Evaluation model on test set"""
    loss = []
    model.eval()
    with torch.no_grad():
        for _dict in test_dataloader:
            _dict = {key:var.to(device) for key, var in _dict.items()}
            preds = model(_dict['cont'], _dict['cat'])
            loss.append(criterion(preds, _dict['target'], yr_weights=_dict['wts']).item())
    return np.mean(loss)

I believe .item() is detaching it. Also I am using aws cloud instance, could this happen due to any version compatibility? I am so puzzled by this bug.

ptrblck · January 28, 2020, 3:52am

Are you only seeing the increase in memory with SparseAdam or also with other optimizers?
I can’t see any obvious errors in your code.

bibinmjose · February 4, 2020, 10:16am

@ptrblck I am not aware of any other sparse optimizer other than SparseAdam. I tried RMSprop but the same error prevails.

RylanSchaeffer · September 30, 2021, 7:36pm

@bibinmjose Did you find the cause of the error?