Interesting Issue with Automatic GPU Garbage Collection

Hello!

I recently found something interesting when trying to train a Mask RCNN. For those unfamiliar with this type of model, there are 2 inputs:

  1. A list of images (variable size)
  2. A list of dictionaries (where each dictionary contains 3 tensors: bounding boxes, segmentation, and class label)

More information regarding input data format can be found here in the documentation

The output is a dictionary of losses (classification loss, segmentation loss, etc). There’s a total of 5 keys/losses. I’m currently summing the losses into one tensor and then calling backward on the summed total loss. (The sum operation will be captured in the tensor graph).

Anyway, here is my problem that I ran into:

Below is some code for a simple training epoch:

for data in enumerate(someDataLoader):
    # get the inputs; data is a list of [inputs, labels]
    images, targets = data
        
    scenes = images = list(image.to(device) for image in images)
    for t in targets:
        t['boxes'] = t['boxes'].to(device)
        t['labels'] = t['labels'].to(device)
        t['masks'] = t['masks'].to(device)

    # forward + backward + optimize
    loss_dict = maskrcnn_model(scenes, targets)
    
    tot_loss = sum(loss for loss in loss_dict.values())
    
    # zero the parameter gradients
    maskrcnn_opt.zero_grad()
    tot_loss.backward()
    maskrcnn_opt.step()

When running this training loop, I notice that the GPU memory is increasing as more mini-batches are fed. (For example, at the start of training there would be a 3 GB memory usage, after 250 mini-batches, the GPU is at 12 GB memory usage, steadily increasing by 300 MB every 2 seconds). Batch size is fairly small.

To fix this, I looked at some online examples of training a Mask RCNN model. Everything was the same except this portion of the training loop:

for t in targets:
    t['boxes'] = t['boxes'].to(device)
    t['labels'] = t['labels'].to(device)
    t['masks'] = t['masks'].to(device)

Instead of doing it the way I did above, they used a list comprehension and basically made a new dictionary for each target, and saved into a separate list. An example of that would be either of the following code snippets below (I find the top code snippet to be more read-able, but they both do the same thing).

newTargs = []
for t in targets:
    x = {}
    x['boxes'] = t['boxes'].to(device)
    x['labels'] = t['labels'].to(device)
    x['masks'] = t['masks'].to(device)
    newTargs.append(x)
newTargs = [{k: v.to(device) for k, v in t.items()} for t in targets]

After making a new list of dictionaries, newTargs is then used in the forward pass of the model, rather than targets.

E.g. loss_dict = maskrcnn_model(images, newTargs).

This small change in code maintains GPU memory of 3 GB throughout the training epoch.

Any thoughts on why?

Thanks,
Epoching

In your first case, you actually modify targets inplace. Which is what the dataloader returned. It is very possible that you are incrementally changing your whole Dataset by moving each target to the GPU one by one.

1 Like

Ah interesting, are you thinking that doing this in-place keeps each minibatch on the GPU (because the DataLoader is passing original references from the source Dataset)?

If this is the case, why doesn’t this happen for a simple image classification dataloader (made from torchvision.datasets, for example):

images = images.cuda()
labels = labels.cuda()

I also forgot to mention that we have to use a custom collate_fn, to get the DataLoader to output batches of a list of image tensors, and a list of dictionaries.

Maybe the default collate_fn argument makes copies of the Dataset examples via torch.stack so it doesn’t alter the device status of the original data in the Dataset.

Will definitely look into this suggestion tonight! Thanks for the tip :slight_smile:

Hi,

Because labels = labels.cuda() this does not change the original labels inplace. But your code does change the original dictionary inplace ! So if this dictionary is the same as the one stored in your Dataset, then you change the one stored in your dataset.

1 Like

This is correct! Below is an example of the collate_fn I was using to batch together examples from the Dataset:

def myCollater(examples):
    input_imgs = []
    target_dicts = []
    for ex in examples:
        input_imgs.append(ex[0])
        target_dicts.append(ex[1])
        
    return torch.stack(input_imgs), target_dicts

I ended up appending the dictionaries to a list that came directly from the Dataset. The input image tensors didn’t stay on the GPU when training because I thought that torch.stack() basically combines multiple image tensors into one single batched tensor (which is a copy of all of the original tensors, so putting it on the GPU isn’t necessarily putting the original image tensors on the GPU).

Also, if we wanted to train a Mask_RCNN with variable image sizes, then the collate function should return a list of image tensors (rather than a single stacked/batched image tensor, since you can’t stack multiple tensors of variable size). This meant I had to switch my return function of myCollater to:

return input_imgs, target_dicts

This also means that I have to now put the images on the GPU one at a time (since input_imgs is a list of image tensors), so in the training loop we now have:

for j in range(len(images)):
    images[j] = images[j].to(device)

instead of:

images = images.to(device)

I thought that this would now cause the images to stay on the GPU, since they are supplied directly from the Dataset class, and appended in a list through the myCollater function. Turns out that these images don’t stay on the GPU, and everything runs fine.

I ran a small experiment to find out that when you add a tensor to a list, a copy of the tensor is made every time it’s added to a list. (So each list has it’s own unique reference to a tensor). For dictionary values, it’s the same thing!

So my GPU memory problem was caused from appending original dictionaries of tensors. Appending dictionaries (or lists) doesn’t necessarily copy the objects for you automatically, which is why we have to do it ourselves explicitly in the training loop (or in the collate_fn). Oops!

Thanks for the info again :slight_smile: