How to split a pretrained model for Model Parallelism?

brix · October 26, 2020, 5:27pm

Hi,

in the DDP tutorial (https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) the following code is shown to split a model onto two GPUs:

class ToyMpModel(nn.Module):
    def __init__(self, dev0, dev1):
        super(ToyMpModel, self).__init__()
        self.dev0 = dev0
        self.dev1 = dev1
        self.net1 = torch.nn.Linear(10, 10).to(dev0)
        self.relu = torch.nn.ReLU()
        self.net2 = torch.nn.Linear(10, 5).to(dev1)

    def forward(self, x):
        x = x.to(self.dev0)
        x = self.relu(self.net1(x))
        x = x.to(self.dev1)
        return self.net2(x)

How can I split a pretrained model (DeeplabV3Resnet101) onto different GPUs?


def getDeepLabV3Resnet101Pretrained(num_of_classes):
    model = models.segmentation.deeplabv3_resnet101(pretrained=1)
  
    # Change number of output classes
    model.classifier[4]  = nn.Conv2d(
    in_channels=256,
    out_channels=num_of_classes,
    kernel_size=1,
    stride=1
    )

   # And now how to put different model parts on different GPUs?
   # Does model.children() help?

How would you determine where to split?
I would try to calculate the number of parameters for every model layer and then make more or less equal splits.

Would this be a good way?

Thanks!

osalpekar · October 26, 2020, 9:11pm

I think you will need to manually place different layers on different GPUs. After that you will need to configure your forward function (similar to the ToyMpModel example you referenced), where you must send the the input batch to the first GPU, get the activations after passing through all of the layers on the first GPU, then send those activations to the next GPU, and so on until the last layer on the last GPU.

We currently don’t provide an automated way of splitting the model optimally across machines, but the approach you mentioned should work. In essence, I would compute the number of parameters in the model and try to create equal splits so that each GPU gets roughly similar number of parameters.

brix · October 27, 2020, 11:44am

Thanks for your answer!
This basically means there is no easy way and I would need to modify a copy of the following code, right?

github.com

pytorch/vision/blob/master/torchvision/models/segmentation/deeplabv3.py

import torch
from torch import nn
from torch.nn import functional as F

from ._utils import _SimpleSegmentationModel


__all__ = ["DeepLabV3"]


class DeepLabV3(_SimpleSegmentationModel):
    """
    Implements DeepLabV3 model from
    `"Rethinking Atrous Convolution for Semantic Image Segmentation"
    <https://arxiv.org/abs/1706.05587>`_.

    Arguments:
        backbone (nn.Module): the network used to compute the features for the model.
            The backbone should return an OrderedDict[Tensor], with the key being
            "out" for the last feature map used, and "aux" if an auxiliary classifier

This file has been truncated. show original

This uses nn.sequential.
Do I also need to change this or does this “.to” work with nn.sequential (no separate forward function) as well?

Thanks!

pritamdamania87 · October 27, 2020, 6:35pm

Do I also need to change this or does this “.to” work with nn.sequential (no separate forward function) as well?

“.to” would work on nn.sequential, although you need to modify the forward function since once you have completed execution for the module on GPU0, the output will be on GPU0. Now since the other module you want to execute is on GPU1, you need to move the output from GPU0 to GPU1 manually (using “.to”) and then you need to execute the module on GPU1.