RuntimeError: CUDA out of memory. Tried to allocate 1.12 MiB (GPU 0; 11.91 GiB total capacity; 5.52 GiB already allocated; 2.06 MiB free; 184.00 KiB cached)

I’m running my code on Tesla P100 (16 Go). l get stuck at CUDA out of memory, l get the following error after 17 epochs of training

line 2234, in forward
    x = nn.AvgPool1d(90, stride=None)(x)
  File "/local/anaconda3/envs/torch_edward/lib/python3.7/site-packages/torch/nn/modules/", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/local/anaconda3/envs/torch_edward/lib/python3.7/site-packages/torch/nn/modules/", line 499, in forward
RuntimeError: CUDA out of memory. Tried to allocate 1.12 MiB (GPU 0; 11.91 GiB total capacity; 5.52 GiB already allocated; 2.06 MiB free; 184.00 KiB cached)

The error comes from forward function : " x = nn.AvgPool1d(90, stride=None)(x)"

def forward(): 

  x=nn.AvgPool1d(90,stride=None)(x) # here l get the error


  return x

Any cue ?

Thank you

Does reducing your batch size help? Can you post your full training script? Have you tried the diagnoses at ?

Hi @ezyang,

I solved that by deleting manually my variables after batch iterations.

Another common solution is :


                y = net.forward(train_x)
            except RuntimeError as e:
                if 'out of memory' in str(e):
                    print('| WARNING: ran out of memory, retrying batch',sys.stdout)
                    for p in net.parameters():
                        if p.grad is not None:
                            del p.grad  # free some memory
                    y= net.forward(train_x)
                    raise e

Or “Preview (Nightly)” PyTorch Build

Weer you able to find out why this is happening? it is very strange since I call optim.zero_grad() inside the training loop. Could it be because cuda is not purging the data from dataloader in every loop and getting accumulated?

Could you check, if wrapping the training step in a function helps, as Python uses function scoping as described here.

Hi @ptrblck, thanks for the reply. I have it already wrapped in train function and call it from the __main__ . I presume this is what you meant? Stil the same error. I tried everything from changing batch_size and reduced the number of parameters. But it throws an error before an epoch completes usually towards the end of epoch like below. also tried calling torch.cuda.empty_cache() inside the training loop with no success either.

  • Error Message : RuntimeError: CUDA out of memory. Tried to allocate 4.37 GiB (GPU 0; 11.17 GiB total capacity; 4.78 GiB already allocated; 1.58 GiB free; 4.49 GiB cached)
    current memory allocated: 4693.4131
    Max memory allocated: 9319.9683
    Cached memory: 9490.0000
def train(args):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    train_dataloader, test_dataloader, embedding_tuple = get_loader(args)
    model = SmallEmbedNN(embedding_tuple, len(args.cont_cols), 1,
                         emb_dropout=0.1, lin_layer_dropouts=0.1).to(device)
    criterion_l1 = L1Loss()
    # sparse optimizer for embeddings and normal optimizer for rest
    embed_sparse_vec = 'emb_layer.weight'
    opt = torch.optim.Adam([weights for name, weights in model.state_dict().items() \
                            if name!=embed_sparse_vec],
    optSparse = torch.optim.SparseAdam([model.state_dict()[embed_sparse_vec]], 
        for epoch in range(args.epochs):
            for _dict in tqdm(train_dataloader):
                # pass vars to cuda device
                _dict = { for key, var in _dict.items()}
                # zeroing parameter gradients

                # Forward Pass
                preds = model(_dict['cont'], _dict['cat'])
                loss_train_l1 = criterion_l1(preds, _dict['target'], 
                yr_weights=_dict['wts']) #L1 loss

                # Backward Pass and Optimization

                # batch train loss
                train_loss = loss_train_l1.item()
      'Epoch:{epoch} train[batch] loss(L1):{train_loss:.4f}')
                # defrag cached memory

            # append train loss

            # evaluate test-set and append test loss
            test_loss = eval_test(model, test_dataloader, criterion_l1, device)
            if epoch%1==0:
      'Epoch:{epoch}\ttrain_loss:{train_loss_list[-1]:.6f} \

            # earlystopping if test loss is not improving in last 10 epochs
            if min(test_loss_list) < min(test_loss_list[-20:]):
    except Exception as e:

@ptrblck Is there any particular method/function to purge the training data out of CUDA after each batch pass? Or does it get emptied by some sort of garbage collection? Or could it be anything to do with sparse optimizer?

@ptrblck Found what is causing the error. It is the Sparse optimizer. When I do not use a sparse embedding layer or optimizer, there seem to be no problem at all. Do you know any such issues relating to sparse optimizer?

@smth Hi Soumith, Have you seen sparse optimizer causing CUDA error. I am guessing it is caused by accumulating params inside GPU inside training loop? I highly appreciate any help in the correct direction. Thanks a lot!

I’m not aware of any issues, which might create unnecessary OOM errors.
If I’m not mistaken, SparseAdam will lazily compute the the updates as:

In this variant, only moments that show up in the gradient get updated, and
only those portions of the gradient get applied to the parameters.

Could you just run out of memory for a specific input, which uses more entries in your sparse input?

1 Like

Hi @ptrblck Thanks for the reply !
I am not sure if I am running out of memory when I hit a specific case, since I tried out with various sizes of train batches updating different number of parameters. Besides, error always throws towards the end of 1st epoch, never in the beginning. I also observed that memory keeps on increasing(not monotonically but trendwise) when using sparse optimizer. Anyways, I will keep investigating.

Could you post a code snippet showing this behavior?
How did you define eval_test?
Could you check that all tensors, which you are appending to a list, are properly detached from the computation graph?


def eval_test(model, test_dataloader, criterion, device):
    """Evaluation model on test set"""
    loss = []
    with torch.no_grad():
        for _dict in test_dataloader:
            _dict = { for key, var in _dict.items()}
            preds = model(_dict['cont'], _dict['cat'])
            loss.append(criterion(preds, _dict['target'], yr_weights=_dict['wts']).item())
    return np.mean(loss)

I believe .item() is detaching it. Also I am using aws cloud instance, could this happen due to any version compatibility? I am so puzzled by this bug.:thinking:

Are you only seeing the increase in memory with SparseAdam or also with other optimizers?
I can’t see any obvious errors in your code.

@ptrblck I am not aware of any other sparse optimizer other than SparseAdam. I tried RMSprop but the same error prevails.

@bibinmjose Did you find the cause of the error?