VGG16 takes up huge GPU memory

I am training a domain adaptation model based on Faster R-CNN. And I already use model parallel to train it. While my single RTX 2080ti can handle the source training(ie train normally on source), I constantly get a CUDA out of memory error when forwarding two images from source and target at the same time even with the model splitted in 4 2080tis. I found that when without splitting, it always ran out of memory during extractor(vgg16 here), so I split vgg16 into three parts onto three GPUs using nn.Sequential. But I still get CUDA out of memory from the second part:

[Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)),
 ReLU(inplace=True),
 Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)),
 ReLU(inplace=True),
 MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)]

even if I freeze the parameters of this part and its previous part.
so what’s wrong with my using of model parallel or anything?

How large are your inputs?
Freezing the weights by setting their requires_grad attribute to False won’t avoid storing the intermediate activations and you should also wrap the code part in a with torch.no_grad() block.

1 Like

Oh thanks I will try that torch.no_grad()
During source training, the size is 2000 for min_size and 4500 for large_size, with which it would not run out of memory. During transfer training, min_size is 1200 at most.
My images are drawn from remote sensing images which are quite large (about 2k * 2k) but the objects are reletativly small (about 100 * 100) and there are some tiny textures on the object like the lines distinguishing tennis court from basketball court, so reducing the size would reduce the performance.

To be more specific, I need to forward pass two images simultaneously when transfer training, that’s why it is all fine when normally training, but out of memory when transfer training.

with torch.no_grad() does help, thanks a lot! But I still want to leave this thread open to see if there are other solutions without freezing the weights, as there is little difference of max memory use of a single card between spliting extractor into 2 and 3 parts.

Just to make sure I understand the issue correctly:
You are feeding two images simultaneously (both are quite large at ~2000x2000 pixels).
While training the model from scratch everything works fine.
If you want to fine-tune a pretrained model, you are running out of memory.
Is this correct or do I miss something?

Thanks for your reply, let me make it clearer.
I have two stages of training. The first stage needs 1 image which works fine. The second phase, transfer, needs to feed two images, but runing out of memory.
What makes me confused is that, a single GPU can handle 1 image and the entire network, but 3 GPUs cannot handle 2 images and only the backbone. And after I splitted the first 30 layers of VGG16 into 3 GPUs, the second part consisting of 5 layers was where the model ran out of memory, rather than the bigger part 1 or part 3.
And one more thing, during transfer (stage 2), I call loss.backward() every 4 batches (8 images totally) and add up the losses before calling to gather enough features for my training. Don’t know if this has something to do with the issue.

The last part of the VGG model contains the linear layers, so the majority of all parameters.
Could you post the code you are using for model sharding so that I could try to reproduce it, please?

I am using the Faster R-CNN of https://github.com/chenyuntc/simple-faster-rcnn-pytorch, and I modified the __init__ of its base model

class FasterRCNN(nn.Module):
    """Base class for Faster R-CNN.

    This is a base class for Faster R-CNN links supporting object detection
    API [#]_. The following three stages constitute Faster R-CNN.

    1. **Feature extraction**: Images are taken and their \
        feature maps are calculated.
    2. **Region Proposal Networks**: Given the feature maps calculated in \
        the previous stage, produce set of RoIs around objects.
    3. **Localization and Classification Heads**: Using feature maps that \
        belong to the proposed RoIs, classify the categories of the objects \
        in the RoIs and improve localizations.

    Each stage is carried out by one of the callable
    :class:`torch.nn.Module` objects :obj:`feature`, :obj:`rpn` and :obj:`head`.

    There are two functions :meth:`predict` and :meth:`__call__` to conduct
    object detection.
    :meth:`predict` takes images and returns bounding boxes that are converted
    to image coordinates. This will be useful for a scenario when
    Faster R-CNN is treated as a black box function, for instance.
    :meth:`__call__` is provided for a scnerario when intermediate outputs
    are needed, for instance, for training and debugging.

    Links that support obejct detection API have method :meth:`predict` with
    the same interface. Please refer to :meth:`predict` for
    further details.

    .. [#] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. \
    Faster R-CNN: Towards Real-Time Object Detection with \
    Region Proposal Networks. NIPS 2015.

    Args:
        extractor (nn.Module): A module that takes a BCHW image
            array and returns feature maps.
        rpn (nn.Module): A module that has the same interface as
            :class:`model.region_proposal_network.RegionProposalNetwork`.
            Please refer to the documentation found there.
        head (nn.Module): A module that takes
            a BCHW variable, RoIs and batch indices for RoIs. This returns class
            dependent localization paramters and class scores.
        loc_normalize_mean (tuple of four floats): Mean values of
            localization estimates.
        loc_normalize_std (tupler of four floats): Standard deviation
            of localization estimates.

    """

    def __init__(self, extractor, rpn, head, freeze,
                 loc_normalize_mean=(0., 0., 0., 0.),
                 loc_normalize_std=(0.1, 0.1, 0.2, 0.2)
    ):
        super(FasterRCNN, self).__init__()
        # ---- I modified here, opt.vgg_slice0 is 13, opt.vgg_slice1 is 17 ----
        self.extractor0 = extractor[:opt.vgg_slice0].to(opt.gpu0)
        self.extractor1 = extractor[opt.vgg_slice0:opt.vgg_slice1].to(opt.gpu1)
        self.extractor2 = extractor[opt.vgg_slice1:].to(opt.gpuv)
        if freeze:
            param = self.extractor0.named_parameters()
            for _, p in param:
                p.requires_grad = False
            param = self.extractor1.named_parameters()
            for _, p in param:
                p.requires_grad = False
        # ---- modification ends ----
        self.rpn = rpn
        self.head = head

        # mean and std
        self.loc_normalize_mean = loc_normalize_mean
        self.loc_normalize_std = loc_normalize_std
        self.use_preset('evaluate')

and the extractor passing to this is from the return of this function:

from torch import nn
from torchvision.models import vgg16


def decom_vgg16():
    # the 30th layer of features is relu of conv5_3
    if opt.caffe_pretrain:  # caffe_pretrain is False
        model = vgg16(pretrained=False)
        if not opt.load_path:
            model.load_state_dict(t.load(opt.caffe_pretrain_path))
    else:
        model = vgg16(not opt.load_path)

    features = list(model.features)[:30]
    classifier = model.classifier

    classifier = list(classifier)
    del classifier[6]
    if not opt.use_drop:
        del classifier[5]
        del classifier[2]
    classifier = nn.Sequential(*classifier)

    # freeze top4 conv
    for layer in features[:10]:
        for p in layer.parameters():
            p.requires_grad = False

    return nn.Sequential(*features), classifier

And also, I was not using the linear layers in the extractor, so it should contain much fewer parameters.