Memory leaks in Pytorch object detection

tengerye · December 8, 2019, 6:59am

I am working on the object detection tutorial on PyTorch. The original tutorial works fine with the few epochs given. I expanded it to large epochs and encounter out of memory error.

I tried to debug it and find something interesting. This is the tool I am using:

def debug_gpu():
    # Debug out of memory bugs.
    tensor_list = []
    for obj in gc.get_objects():
        try:
            if torch.is_tensor(obj) or (hasattr(obj, 'data') and torch.is_tensor(obj.data)):
                tensor_list.append(obj)
        except:
            pass
    print(f'Count of tensors = {len(tensor_list)}.')

And I used it to monitor the memory of training one epoch:

def train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq):
    ...
    for images, targets in metric_logger.log_every(data_loader, print_freq, header):
        # inference + backward + optimization
        debug_gpu()

The output is something like this:

Count of tensors = 414.
Count of tensors = 419.
Count of tensors = 424.
Count of tensors = 429.
Count of tensors = 434.
Count of tensors = 439.
Count of tensors = 439.
Count of tensors = 444.
Count of tensors = 449.
Count of tensors = 449.
Count of tensors = 454.

As you can see, the count of tensors tracked by garbage collector increases constantly.

Relevant files to execute can be found here.

I have two questions:

What is holding up the garbage collector to release these tensors?
What should I do with the out of memory error?

fmassa · December 10, 2019, 5:50pm

Hi,

I think we might need some more information before we can understand where the problem is.

are you using multiple GPUs to train the model?
can you remove the metric_logger logging, and just iterate over the dataloader

Those are the two things that came to my mind. I haven’t experienced this OOM before, so I’m not sure what else it could be for now

tengerye · December 11, 2019, 1:18am

Hi,
Thank you for the kind reply.

I used one gpu. But even if I used CPU, the situation happens;
I did try to remove metric_logger, but it still happens. I doubt if the model itself caused it.

The original tutorial has such few epochs to trigger the OOM, but I believe you can easily repeat my situation with my code.

tengerye · December 11, 2019, 9:28am

Hi @fmassa, I think I found the reason: rpn.anchor_generator._cache keeps the pair with the model and the number of tensors it holds expands as proposal grows. I provide a sample for an easy repeat.

BTW, I send a pull request with the fix along with other data type error solution but codecov failed me. Would you please have a look?

albanD · December 11, 2019, 3:57pm

PR was merged. Thanks for the contribution @tengerye