Parallelizing models of different classes over multiple GPUs

I’m interested in learning what are the good practices in parallelizing pytorch models over multiple GPUs; more specifically I have two questions.

My first question concerns nn.DataParallel and a class that inherits from it, DataParallelPassthrough. More specifically, in this work, a pre-trained GAN generator, G, which has been set to evaluation mode, is used for generating images. This model, as an instance of a specific class is used as:

G = DataParallelPassthrough(G)

where DataParallelPassthrough is defined as follows:

from torch import nn

class DataParallelPassthrough(nn.DataParallel):
    def __getattr__(self, name):
        try:
            return super(DataParallelPassthrough, self).__getattr__(name)
        except AttributeError:
            return getattr(self.module, name)

How is DataParallelPassthrough different than standard nn.DataParallel? Why should one prefer the former over the latter? I have seen it being used in some repos, but I couldn’t find any good explanation why. Could you help me understand?

Now, except for the aforementioned generator model G, there is another model (an instance of a different class), which takes G's generated images as input. You may think of it as a ResNet-like model. However, this model is not set as an instance of DataParallelPassthrough. This causes some issues. For instance, while the whole model would fit in a 32GB Tesla V100, it doesn’t in a pair of 16GB V100’s.

My second question is how should I parallelize both models. Should I instantiate both models using nn.DataParallel or DataParallelPassthrough? I would like to avoid the case where I load each model to a specific GPU device (e.g., using .to()).

Thank you!