Sharding model across GPUs

nn.DataParallel allows to replicate and parallelize the execution of a model by sharding over the batch.
This assumes the model can fit inside of GPU memory. Is there a natural way in pytorch to run across multi-GPU a single model.

On a similar topic, give a GAN setting with a generator and a discriminator and two GPUs, what is the recommendation to speed-up the computation, given the dependency between discriminator and generator?

1 Like

Yes, you can split your single model across multiple-GPUs in PyTorch with minimum fuss. Here is an example from @apaszke :

class MyModel(nn.Module):
    def __init__(self, split_gpus):
        self.large_submodule1 = ...
        self.large_submodule2 = ...

        self.split_gpus = split_gpus
        if split_gpus:

    def forward(self, x):
        x = self.large_submodule1(x)
        if split_gpus:
            x = x.cuda(1) # P2P GPU transfer
        return self.large_submodule2(x)

One caveat (to min. fuss) is that you probably want to try several split points for optimal GPU memory consumption across multiple devices! [Here] ( is a more fleshed out example with VGG-16 :slight_smile:


This is interesting, but it does not really run it in parallel. While you’re running module 1, module 2 is idling (actually, the corresponding GPU) and viceversa. I was looking for a way to keep both GPUs busy at all times. :slight_smile:


nn.DataParallel is built exactly for what you want :slight_smile:


This inherently will involve some idling because of the sequential nature of the forward + backward. This is model parallelism (as opposed to data parallelism). For example, see here