How to split a pretrained model for Model Parallelism?

Hi,

in the DDP tutorial (https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) the following code is shown to split a model onto two GPUs:

class ToyMpModel(nn.Module):
    def __init__(self, dev0, dev1):
        super(ToyMpModel, self).__init__()
        self.dev0 = dev0
        self.dev1 = dev1
        self.net1 = torch.nn.Linear(10, 10).to(dev0)
        self.relu = torch.nn.ReLU()
        self.net2 = torch.nn.Linear(10, 5).to(dev1)

    def forward(self, x):
        x = x.to(self.dev0)
        x = self.relu(self.net1(x))
        x = x.to(self.dev1)
        return self.net2(x)

How can I split a pretrained model (DeeplabV3Resnet101) onto different GPUs?


def getDeepLabV3Resnet101Pretrained(num_of_classes):
    model = models.segmentation.deeplabv3_resnet101(pretrained=1)
  
    # Change number of output classes
    model.classifier[4]  = nn.Conv2d(
    in_channels=256,
    out_channels=num_of_classes,
    kernel_size=1,
    stride=1
    )

   # And now how to put different model parts on different GPUs?
   # Does model.children() help? 

How would you determine where to split?
I would try to calculate the number of parameters for every model layer and then make more or less equal splits.

Would this be a good way?

Thanks!

I think you will need to manually place different layers on different GPUs. After that you will need to configure your forward function (similar to the ToyMpModel example you referenced), where you must send the the input batch to the first GPU, get the activations after passing through all of the layers on the first GPU, then send those activations to the next GPU, and so on until the last layer on the last GPU.

We currently don’t provide an automated way of splitting the model optimally across machines, but the approach you mentioned should work. In essence, I would compute the number of parameters in the model and try to create equal splits so that each GPU gets roughly similar number of parameters.

Thanks for your answer!
This basically means there is no easy way and I would need to modify a copy of the following code, right?

This uses nn.sequential.
Do I also need to change this or does this “.to” work with nn.sequential (no separate forward function) as well?

Thanks!

Do I also need to change this or does this “.to” work with nn.sequential (no separate forward function) as well?

“.to” would work on nn.sequential, although you need to modify the forward function since once you have completed execution for the module on GPU0, the output will be on GPU0. Now since the other module you want to execute is on GPU1, you need to move the output from GPU0 to GPU1 manually (using “.to”) and then you need to execute the module on GPU1.