GPU memory consumption increases while training

Hello, all
I am new to Pytorch and I meet a strange GPU memory behavior while training a CNN model for semantic segmentation. Batchsize = 1, and there are totally 100 image-label pairs in trainset, thus 100 iterations per epoch. However the GPU memory consumption increases a lot at the first several iterations while training.

[Platform] GTX TITAN X (12G), CUDA-7.5, cuDNN-5.0

torch.backends.cudnn.enabled = False
torch.backends.cudnn.benchmark = False

Then GPU memory consumption is 2934M – 4413M – 4433M – 4537M – 4537M – 4537M at the first six iterations.

torch.backends.cudnn.enabled = True
torch.backends.cudnn.benchmark = True

Then GPU memory consumption is 1686M – 1791M – 1791M – 1791M – 1791M – 1791M at the first six iterations.

Why GPU memory consumption increases while training, especially, increases so largely while no cuDNN? (In my opinion, GPU memory consumption won’t increase while the CNN has been build and starts training)
Does anyone meet the same problem? Or could anyone give some help?

This is the code snippet

def train(train_loader, model, criterion, optimizer, epoch):
    batch_time = AverageMeter()
    data_time = AverageMeter()
    losses = AverageMeter()

    # switch to train mode
    model.train()

    end = time.time()
    for i, (input, target) in enumerate(train_loader):
        # measure data loading time
        data_time.update(time.time() - end)

        target = target.long()
        input = input.cuda(async=True)
        target = target.cuda(async=True)
        input_var = torch.autograd.Variable(input)
        target_var = torch.autograd.Variable(target)

        # compute output
        output = model(input_var)
        loss = criterion(output, target_var)

        # record loss
        losses.update(loss.data[0], input.size(0))

        # compute gradient and do SGD step
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # measure elapsed time
        batch_time.update(time.time() - end)
        end = time.time()

        if i % args.print_freq == 0:
            print('Epoch: [{0}][{1}/{2}]\t'
                  'Time {batch_time.val:.3f} ({batch_time.avg:.3f})\t'
                  'Data {data_time.val:.3f} ({data_time.avg:.3f})\t'
                  'Loss {loss.val:.4f} ({loss.avg:.4f})'.format(
                   epoch, i+1, len(train_loader),
                   batch_time=batch_time,
                   data_time=data_time,
                   loss=losses))
5 Likes

@apaszke @smth Would you please give some advice on this problem? It seems that you have a good knowledge at Pytorch. THANKS VERY MUCH!!!

If you add del loss, output at the end of the loop the memory usage will likely remain the same after the first iteration (what you see it probably a side-effect of Python’s scoping rules). It’s possible that cuDNN uses much less memory than the default backend.

13 Likes

Wow, it really works. Thank you very much. There are so many things for me to learn python and pytorch.:smiley:

When running a tiramisu model, I found that the gpu use was 4.5 GB during 1st epoch and during the 2nd it shot up. Based on your comment I did the following

out = model(input)
loss = crite(out, labels)

loss.backward()
del loss
del out

And thankfully I saw my network using the same amount of GPU in the consecutive epochs

Oh…, It really helps. TKS!

Hi @apaszke, I tried your solution but it doesn’t solve my problem. See here about my situation. Really appreciate your help, thanks!

Answered in the thread.

Hello,

I have the same problem as my GPU memory usage increases on the nvidia-smi output while torch.cuda.memory_allocated('cuda:0') always outputs the same value.

I already use del loss and detach_() some tensor after the backward() call.

Still I never get OOM error as my training always reaches its end.

Thank you

Could you post a code snippet to reproduce this issue please?
Based on the description it seems as if memory is really leaked (not just increased to to e.g. storing the computation graph).

if use_gpu:
    if torch.cuda.is_available():
        torch.backends.cudnn.enabled = True
        torch.backends.cudnn.benchmark = True
        retinanet = retinanet.cuda()

if torch.cuda.is_available():
    retinanet = torch.nn.DataParallel(retinanet).cuda()
else:
    retinanet = torch.nn.DataParallel(retinanet)

for epoch_num in range(parser.epochs):

    train_loss = train(dataloader_train, retinanet, optimizer, writer, epoch_num, train_hist)

    val_loss = eval(dataloader_val, retinanet, writer, epoch_num, val_hist)

    AP_eval = csv_eval.evaluation(dataset_val, retinanet)

def train(dataloader_train, model, optimizer, writer, epoch, train_hist):

    print("Train")

    model.train()

    model.module.set_compute(True)

    epoch_loss = []

    for iter_num, data in enumerate(dataloader_train):
        try:
            optimizer.zero_grad()

            if torch.cuda.is_available():
                classification_loss, regression_loss = model([data['img'].cuda().float(), data['annot']])
            else:
                classification_loss, regression_loss = model([data['img'].float(), data['annot']])

            classification_loss = classification_loss.mean()
            regression_loss = regression_loss.mean()
        
            loss = classification_loss + regression_loss

            if bool(loss == 0):
                continue

            loss.backward()

            torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)

            optimizer.step()

            classification_loss.detach_()
            regression_loss.detach_()
            loss.detach_()

            if parser.debug != 1:
                writer.add_scalar("ClassLoss/Train", classification_loss.cpu(), (iter_num + len(dataloader_train) * epoch))
                writer.add_scalar("RegLoss/Train", regression_loss.cpu(), (iter_num + len(dataloader_train) * epoch))
                writer.add_scalar("Total/Train", loss.cpu(), (iter_num + len(dataloader_train) * epoch))

            train_hist.append(float(loss.cpu()))

            epoch_loss.append(float(loss.cpu()))

            if iter_num % int(len(dataloader_train)/10) == 0:

                print(
                'Epoch: {:3d}/{:3d} | Iteration: {:4d}/{:4d} | Classification loss: {:1.5f} | Regression loss: {:1.5f} | Epoch loss: {:1.5f} | Average loss: {:1.5f}'.format(
                    epoch, parser.epochs,  iter_num, len(dataloader_train), float(classification_loss.cpu()), float(regression_loss.cpu()), np.mean(epoch_loss), np.mean(train_hist)))

            del classification_loss
            del regression_loss
            del loss
        except Exception as e:
            print(e)
            continue

    if parser.debug != 1:
        writer.add_scalar("EpochLoss/Train", np.mean(epoch_loss), epoch)

    return np.mean(epoch_loss)

 def eval(dataloader_val, model, writer, epoch, val_hist):

    print("Eval")

    model.eval()
    model.module.freeze_bn()
    model.module.set_compute(True)

    epoch_loss = []

    with torch.no_grad():
        for iter_num, data in enumerate(dataloader_val):
            try:
                if torch.cuda.is_available():
                    classification_loss, regression_loss = model([data['img'].cuda().float(), data['annot']])
                else:
                    classification_loss, regression_loss = model([data['img'].float(), data['annot']])

                classification_loss = classification_loss.mean().cpu()
                regression_loss = regression_loss.mean().cpu()

                loss = classification_loss + regression_loss

                if bool(loss == 0):
                    continue

                if parser.debug != 1:
                    writer.add_scalar("ClassLoss/Eval", classification_loss, (iter_num + len(dataloader_val) * epoch))
                    writer.add_scalar("RegLoss/Eval", regression_loss, (iter_num + len(dataloader_val) * epoch))
                    writer.add_scalar("Total/Eval", loss, (iter_num + len(dataloader_val) * epoch))

                val_hist.append(float(loss))
                epoch_loss.append(float(loss))

                if iter_num % int(len(dataloader_val)/10) == 0:

                    print('Epoch: {:3d}/{:3d} | Iteration: {:4d}/{:4d} | Classification loss: {:1.5f} | Regression loss: {:1.5f} | Epoch loss: {:1.5f} | Average loss: {:1.5f}'.format(epoch, parser.epochs,  iter_num, len(dataloader_val), float(classification_loss), float(regression_loss), np.mean(epoch_loss), np.mean(val_hist)))
                
                del classification_loss
                del regression_loss
                del loss

            except Exception as e:
                print(e)
                continue

    if parser.debug != 1:
        writer.add_scalar("EpochLoss/Eval", np.mean(epoch_loss), epoch)

    return np.mean(epoch_loss)

Here is my snippet code.

csv_eval.evaluation(dataset_val, retinanet)

Is just inference to compute the mAP.

What I have observed is that during the first training epoch the nvidia-smi memory usage increases a bit, we have a low value during the eval epoch and a very low during inference (normal behaviour). Second training epoch the memory still increases but what is surprising is during the eval and inference the memory usage stays at the same level as during training.

After 2-3 epochs it stays around the same value whatever the train-eval-inference phase.

In that case I might have misunderstood the issue and thought the memory would increase in each eproch.
If you are seeing some memory increase in the first epochs, this might be due to memory fragmentation. E.g. if you are finishing the first training and validation run, PyTorch might free some intermediate tensors, which are not referenced anymore. The next memory allocation might not fit in the freed blocks, so that new memory has to be allocated.

How large is the memory increase after the first, second and third epoch?

I just made a test now.
Epoch 1:
train iter 1: 6139M
train iter last: 7209M
eval iter 1: 3291M
eval iter last: 3291M
inference iter 1: 3291M
inference iter last: 3291M
Epoch 2:
train iter 1: 6409M
train iter last: 7477M
eval iter 1: 7479M
eval iter last: 7479M
inference iter 1: 7479M
inference iter last: 7479M
Epoch 3:
train iter 1: 7479M
train iter last: 7479M
eval iter 1: 7479M
eval iter last: 7479M
inference iter 1: 7479M
inference iter last: 7479M

Maybe this is normal in which case i learned something new. My guess is that there is a “knowledge” of the pattern train>eval>inference and the gpu stores in memory the computational graph even if we are in a torch.no_grad() env for the next epochs?

Note that PyTorch tries to reuse the cached memory in order to avoid cudaMalloc calls.
If you are measuring the memory usage via nvidia-smi only, you’ll see the overall used memory (allocated + cached + CUDA context + other processes).

You could check the allocated memory via torch.cuda.memory_allocated(), which would most likely go down during evaluation, if you’ve properly freed the training data.

The behaviour follows what you said. Thank you for the explanation!

I faced a similar issue. My issue turned out to be I was collecting validation stats across batches, but this was all happening on the GPU device side. I solved the problem by detaching from the GPU and sending my collected validation data to the CPU.

I’m sorry for this delayed reply on this thread. I arrived at this issue a couple of moments ago and your answered solved it for me. But I’m not entirely sure why. Can you direct me to some resource where I can read more about this (particularly about the cause)?

GPU memory consumption increases a lot at the first several iterations while training.

I am also facing this issue. I tried the solution

But this did not have noticeable impact. My training loop is as follows:

def run(self):
    pb = tqdm(
        total=self._n,
        leave=self._keep,
        colour=self._bar_color,
        desc=self._desc
    )
    train_accuracies = 0.0
    train_ious = 0.0
    train_mses = 0.0
    for data_dict in self._generator:
        if self._eval_mode:
            self._model.eval()
        else:
            self._model.train()
        self._optimizer.zero_grad()
        pred = self._model(
            x_a=data_dict['x1'],
            x_b=data_dict['x2'],
        )
        im_acc = label_accuracy(
            prediction=pred['y1'],
            target=data_dict['y1']
        )
        im_iou = iou_score(
            predictions=pred['y1'],
            targets=data_dict['y1']
        )
        pc_mse = pc_mse_metric(
            prediction=pred['x2'],
            target=data_dict['x2']
        )
        train_accuracies += im_acc
        train_ious += im_iou
        train_mses += pc_mse
        if not self._eval_mode:
            loss = self._criterion(
                predictions=pred, targets={
                    'y1': data_dict['y1'],
                    'x2': data_dict['x2']
                }
            )

            loss.backward()
            self._optimizer.step()
        pb.update(n=1)
    k = "Validation" if self._eval_mode else "Training"
    self._metrics.append(
        {
            k: {
                'Accuracy': (train_accuracies / self._n),
                'mIoU': (train_ious / self._n),
                'MSE': (train_mses / self._n)
            }
        }
    )

Due this problem I am encountering RuntimeError: CUDA out of memory after a few iterations.
What am I doing wrong?

In your code you are accumulating stats in:

        train_accuracies += im_acc
        train_ious += im_iou
        train_mses += pc_mse

which could increase the memory usage, if some of these tensors are still attached to the computation graph since the entire graph would also be stored in each iteration.
Assuming you want to track these statistics without calling backward() on any of these tensors, make sure to .detach() the tensors before adding them or call .item() in case it’s a scalar value.

2 Likes