Model parallelism in Multi-GPUs: forward/backward graph

In Model parallelism, A DNN is divided into sub-modules and each module is handled by a GPU.

The forward graph, I assume, spans multiple GPUs.

Will the backward graph along with any internal data also span multiple GPUs?

Yes, this should be the case.
You can check if by using model sharding and then print the device of the gradients for each submodule.
They should be located on the corresponding device id.

Genuine question, is that “model sharding” or model parallelism implemented, yet? So far, I’ve only been doing the multi-GPU stuff with data parallelism (e.g., like https://pytorch.org/tutorials/beginner/former_torchies/parallelism_tutorial.html).

Yes, you can do “model parallelism” for instance with 2 gpus like this:

class Network(nn.Module):
    def __init__(self, split_gpus):
        self.module1 = (some layers)
        self.module2 = (some layers)

        self.split_gpus = split_gpus
        if self.split_gpus:
            self.module1.device("cuda:0")
            self.module2.device("cuda:1")

    def forward(self, x):
        x = self.module1(x)
        if self.split_gpus:
            x = x.device("cuda:1") 
        return self.module2(x)
3 Likes

Oh I see, thanks!

Regarding the speed-up. It is probably less than for data parallelism, right? Because for module 2 to compute the forward pass, you need to know the results of module 1. (And vice versa for backprop.)

So the advantage would be though that it would allow for fitting basically large models into memory rather than training speed?

Yes exactly. At least that’s what I’ve used it for.
The transfer between both GPUs is done via P2P so that no host communication is needed as far as I know.

2 Likes

I implemented something like @Dhorka’s example above, and the (simplified) following works in the forward pass, but…

class Network(nn.Module): #simplified
    def __init__(self, split_gpus):
        self.layerseq1 = nn.Sequential(OrderedDict(some layers))
        self.layerseq2 = nn.Sequential(OrderedDict(some layers))
        self.split_gpus = split_gpus
        if self.split_gpus:
            self.layerseq1.cuda(device=0)
            self.layerseq2.cuda(device=1)

    def forward(self, x):
        x = self.layerseq1(x)
        if self.split_gpus:
            x = x.cuda(device=1) 
        return self.layerseq2(x)

…it runs into problems when it tries to run the nn.MSELoss, claiming:

Assertion THCTensor_(checkGPU) [...] failed. Some of the weight/gradient/input tensors 
are on different GPUs. Please move them to a single one.
 at /opt/software/pytorch/aten/src/THCUNN/generic/MSECriterion.cu:13

While running an epoch such as:

criterion = nn.MSELoss()
for batch_idx, (data, target) in enumerate(some_dataset()):
     data= Variable(data).cuda()
     output = Network(data)
     loss = criterion(output, data)   # <-- fails at runtime
     optimizer.zero_grad()
     ...

I thought that couldn’t possibly mean that all trained layers need to be on the same GPU, so I tried moving both arguments to the same gpu before loss evaluation, (below), but that seems to have no effect

output.cuda(device=1)
data.cuda(device=1)

I’m still getting up to speed on pytorch, so any guidance would be appreciated…

What is wrong with wrapping the model into DataParallel, as @rasbt mentioned above, besides the memory motivation?

With DataParallel, you accumulate the gradients from all GPUs on one of the main GPUs temporarily. Sometimes it can be an issue if all your GPUs use up a lot of memory (then this will basically result in an out-of-memory error). So, currently it is not possible to use a GPU for accumulating the gradients that is not also already used during training and is already holding a model in memory. In practice, I haven’t encountered an issue with that yet but it could be nicer (although maybe a bit slower) to use a dedicated GPU for the gradient accumulation such that memory spikes can be avoided

I haven’t had this issue either yet, but I see where you are coming from. It would be great to have an option to have a smaller GPU for just gradient accumulation as a central hub.

I just realized that my answer was in a different context … regarding your question, I thought you were referring to Uneven GPU utilization during training backpropagation So my answer may be a bit out of context given the problem discussed in this thread :stuck_out_tongue:

@halahup - It’s 100% the memory motivation. Loading this full model for training exceeds the amount of vram in a V100, so data parallelism won’t help.

Ok, gotcha, then yes!

Could the issue be related to this? On pytorch’s CUDA Semantics page, it says: “Unless you enable peer-to-peer memory access, any attempts to launch ops on tensors spread across different devices will raise an error.”

Does running MSELoss count as “launching ops”? I’d test it by enabling it, but there doesn’t seem to be any documentation on how to enable GPU peer-to-peer memory access on pytorch either here or on Google, apart from one post where @smth indicates it’s on by default…

Seems like your labels and outputs are on different GPUs. Move them to the same GPU and it should work. Something like:

output = model(data)
target = target.to(output.device)
loss = F.nll_loss(output, target)
loss.backward()