Pytorch GPU memory increase after first batch and not released

YONG_ZHANG · June 4, 2019, 7:18pm

Hi there,

I come across with a strange GPU memory issue when training my model. GPU memory increases dramatically and never got released after loss.backward(), and then an “CUDA out of memory” error comes up. I feel it’s a memory leak but I just don’t know where the problem is.

I’d greatly appreciate your help. Thanks in advance!

My ec2 envrionment:

OS:  linux
Python:  3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56)
[GCC 7.2.0]
PyTorch:  1.2.0.dev20190603
Numpy:  1.15.4
GPU:  ['Tesla V100-SXM2-16GB']
CUDA Version 9.0.176CUDA Patch Version 9.0.176.1CUDA Patch Version 9.0.176.2CUDA Patch Version 9.0.176.3CUDA Patch Version 9.0.176.4
CuDNN Version  7.3.1

Error log

I added some CUDA memory tracker. We can see that the cuda memory increase after the loss back propagation step, especially the cached memory. The memory is mainly taken by the embedding layer which embed items to vectors. I tried to manually deleted all the tensors and cleared cached memory. It seems the cached memory is successfully cleared but something is still residing in GPU.

Now starting model fitting.

#items:      12879226

#Customers: 4407496

Model loaded

Before forward pass - Cuda memory allocated: 3.297393664

Before forward pass - Cuda memory cached: 3.300917248

After forward pass - Cuda memory allocated: 3.304946688

After forward pass - Cuda memory cached: 3.309305856

After backprop - Cuda memory allocated: 6.59643904

After backprop - Cuda memory cached: 13.207863296

After manually collecting garbage - Cuda memory allocated: 6.59593728

After manually collecting garbage - Cuda memory cached: 6.603931648

Before forward pass - Cuda memory allocated: 6.59593728

Before forward pass - Cuda memory cached: 6.603931648

After forward pass - Cuda memory allocated: 6.602330112

After forward pass - Cuda memory cached: 6.608125952

Traceback (most recent call last):

  File "/home/ubuntu/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main

    "__main__", mod_spec)

  File "/home/ubuntu/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code

    exec(code, run_globals)

  File "PBRS_Models/__main__.py", line 37, in <module>

    args["func"](package_parameters)

  File "PBRS_Models/src/temporal_cnn/main.py", line 169, in main

    args["func"](args)

  File "PBRS_Models/src/temporal_cnn/main.py", line 362, in parse_train_args

    loading_procs

  File "PBRS_Models/src/temporal_cnn/train.py", line 76, in main

    model_fitting.main(dir_processed_data, dir_results, device, is_resume, **config_params)

  File "PBRS_Models/src/temporal_cnn/model_fitting.py", line 268, in main

    model, train_data, optimiser, epoch, device, **config_params

  File "PBRS_Models/src/temporal_cnn/model_fitting.py", line 154, in train_epoch

    loss.backward()

  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 118, in backward

    torch.autograd.backward(self, gradient, retain_graph, create_graph)

  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward

    allow_unreachable=True)  # allow_unreachable flag

RuntimeError: CUDA out of memory. Tried to allocate 3.07 GiB (GPU 0; 15.75 GiB total capacity; 12.29 GiB already allocated; 1.19 GiB free; 9.19 MiB cached)

Code snippet

Data_source is a generator used to load data and I am using SGD optimizer and CrossEntropyLoss criterion.

def train_epoch(model, data_source, optimiser, epoch, device, **config_params):
    """Function to train one epoch.
 
    :param model:          The pytorch model to be trained
    :type model:           torch.nn.module
    :param data_source:    Training data generator
    :type data_source:     generator
    :param optimiser:      The pytorch optimiser used to optimise model loss function 
    :type optimiser:       torch.optimiser
    :param epoch:          Epoch step
    :type epoch:           int
    :param device:         The device to load the model onto (CPU or CUDA device).
    :type device:          torch.device
   :param config_params:  Other configuration parameters used to control model training.
    :type config_params:   dict of str->Object
    """
 
    model.train()
    total_loss = 0
    start_time = time.time()
 
    for batch_idx in range(1, data_source.num_batches+1):
        # Calculate the loss and run the backpropagation.
        batch = next(data_source)
        batch = batch.to(device)
        inputs, targets = Variable(batch[:,:-1]), Variable(batch[:,1:])
        optimiser.zero_grad()
        print(f'Before forward pass - Cuda memory allocated: {torch.cuda.memory_allocated()/1e9}')
        print(f'Before forward pass - Cuda memory cached: {torch.cuda.memory_cached()/1e9}')
        logits, new_targets = model(inputs, targets)
        print(f'After forward pass - Cuda memory allocated: {torch.cuda.memory_allocated()/1e9}')
        print(f'After forward pass - Cuda memory cached: {torch.cuda.memory_cached()/1e9}')
        loss = criterion(logits.view(-1, config_params["softmax_nsampled"]+1), new_targets)
        loss.backward()
        print(f'After backprop - Cuda memory allocated: {torch.cuda.memory_allocated()/1e9}')
        print(f'After backprop - Cuda memory cached: {torch.cuda.memory_cached()/1e9}')
        if config_params["clip"] > 0:
            nn.utils.clip_grad_norm_(model.parameters(), config_params["clip"])
        optimiser.step()
        total_loss += loss.item()
        del logits, new_targets, loss, inputs, targets
        torch.cuda.empty_cache()
        print(f'After manually collecting garbage - Cuda memory allocated: {torch.cuda.memory_allocated()/1e9}')
        print(f'After manually collecting garbage - Cuda memory cached: {torch.cuda.memory_cached()/1e9}')
 
        log_interval = config_params["log_interval"]
        if batch_idx % log_interval == 0:
            cur_loss = total_loss/log_interval
            elapsed = time.time() - start_time
            LOGGER.info('| Epoch: {:3d} | {:5d}/{:5d} batches | lr {:02.5f} | ms/batch {:5.5f} | '
                  'loss {:5.2f} |'.format(
                epoch, batch_idx, data_source.num_batches, optimiser.param_groups[0]['lr'],
                elapsed * 1000 / log_interval, cur_loss))
            total_loss = 0
            start_time = time.time()

YONG_ZHANG · June 6, 2019, 6:16pm

After carefully looking into my code, I find that I am referring to embedding layer weights layer some other place. After fixing the issue, the memory looks stable now.

Tanmay_Jaiswal · October 21, 2019, 10:45am

I am facing a similar issue. Can you elaborate on what was happening and how you fixed it?

ashesh-0 · May 12, 2020, 7:08am

I had a similar issue. In my case, the problem was in aggregating the loss. I was not using item() function to get the scalar loss. Instead I was saving the loss itself.
Thanks !