How pytorch's parallel method and distributed method works?


(Erick Guan) #1

\I’m no expert in distributed system and CUDA. But there is one really interesting feature that PyTorch support which is nn.DataParallel and nn.DistributedDataParallel. How they are actually implemented? How they separate common embeddings and synchronize data?

Here is a basic example of DataParallel.

import torch.nn as nn
from torch.autograd.variable import Variable
import numpy as np

class Model(nn.Module):
    def __init__(self):
        super().__init__(
            embedding=nn.Embedding(1000, 10),
            rnn=nn.Linear(10, 10),
        )

    def forward(self, x):
        x = self.embedding(x)
        x = self.rnn(x)
        return x

model = nn.DataParallel(Model())
model.forward(Variable.from_numpy(np.array([1,2,3,4,5,6], dtype=np.int64)).cuda()).cpu()

PyTorch can split the input and send them to many GPUs and merge the results back.

How does it manage embeddings and synchronization for a parallel model or a distributed model?

I wandered around PyTorch’s code but it’s very hard to know how the fundamentals work.


I’ve posted this question on StackOverflow.


(Sebastian Raschka) #2

I am not sure about DistributedParallel but in DataParallel each GPU gets a copy of the model, so, the parallelization is done via splitting the minibatches, not the layers/weights.

Here’s a sketch of how DataParallel works, assuming 4 GPUs where GPU:0 is the default GPU.


(Landscape) #3

dear @Rasbt
I am so sorry to recognize the step5 draft words , could you pls re-type here for reply . thanks a lot


(Sebastian Raschka) #4

Haha no problem

Compute the loss with respect to the network outputs on the default GPU and scatter the losses back to the individual GPUs to compute the gradients with respect to the leaf nodes


(Harshil Jain) #5

Can every network be ported for multi gpu if the model fits on single gpu?


(Sebastian Raschka) #6

Yes, I believe so. If your model does not fit on a single GPU though, there’s even another method that you can use (instead of DataParallel). E.g.,

class Network(nn.Module):
    def __init__(self, split_gpus):
        self.module1 = (some layers)
        self.module2 = (some layers)

        self.split_gpus = split_gpus
        if self.split_gpus:
            self.module1.device("cuda:0")
            self.module2.device("cuda:1")

    def forward(self, x):
        x = self.module1(x)
        if self.split_gpus:
            x = x.device("cuda:1") 
        return self.module2(x)

(Harshil Jain) #7

@rasbt

Any views on this thread
How to parallelise a pytorch model on GPU?


(Landscape) #8

why to gather all outputs into GPU0 in step5 , it merely calculate distance of output_i with target_i ? why not calculate it within each GPU separately ?

one possible reason : all target (ground truth) are just stored in GPU0 , however , target can also be scattered as a addition of minibatch, like scatter data chunk , in step1


(Sebastian Raschka) #9

why to gather all outputs into GPU0 in step5 , it merely calculate distance of output_i with target_i ? why not calculate it within each GPU separately ?

It’s more like an implementation side-effect. In fact, you can do what you proposed, but you would have to rewrite your code then to compute the loss inside the model – usually, the loss is computed in the training loop based on the outputs of the model.

I mean, it’s not a problem at all to do what you propose, but it’s a bit less convenient, because when you decide to use DataParallel (occasionally), you would have to modify your model code as well.

Also, computing the loss is super cheap, so there is not much speed gain in distributing it across the devices, I suppose.


(Landscape) #10

sorry ,I can’t agree with this explanation , thrifting or economy the distributing cluster’s compute energy . Because transportation data is not cheaper than calculating within micro chip , in my opinion .

is there some welcomed link or chart to strengthen ?


(Sebastian Raschka) #11

sorry ,I can’t agree with this explanation , thrifting or economy the distributing cluster’s compute energy . Because transportation data is not cheaper than calculating within micro chip , in my opinion .

is there some welcomed link or chart to strengthen ?

I don’t have a chart handy, but when I ran the code on my workstation (only 4x 1080Tis), I didn’t notice any difference between letting the GPU handle the loss computation versus distributing the results, calculating the results on the individual GPUs, and then gathering them. Either way, I get the same ~3x speedup when using 4 instead of 1 GPU. This is a small hardware setup, and the scenario is probably different for large datacenters, like you said… Also, the way I implemented it,

model = MyModel(num_features=num_features, num_classes=num_classes)
if torch.cuda.device_count() > 1:
    print("Using", torch.cuda.device_count(), "GPUs")
    model = nn.DataParallel(model)

and then

for epoch in range(NUM_EPOCHS):
    
    model.train()
    for batch_idx, (features, targets) in enumerate(train_loader):
        
        features = features.to(DEVICE)
        targets = targets.to(DEVICE)
            
        ### FORWARD AND BACK PROP
        logits, probas = model(features)
        cost = cost_fn(logits, targets)
        optimizer.zero_grad()
        
        cost.backward()
        
        ### UPDATE MODEL PARAMETERS
        optimizer.step()

Is basically such that I don’t have to rewrite my model when I decide to use DataParallel, which is most convenient for my use cases since I do not always run DataParallel.


(Harshil Jain) #12

Thanks for the detailing here.
How can we put the computation part of the losses on another gpu(GPU:4) rather than GPU:0 which happens by default.

It will be great to have the method as many networks are designed such that they cover ~11 gigs and this things holds you back while parallelizing such a network.


(Erick Guan) #13

It doesn’t makes sense. If the loss aggregation is the bottleneck, what’s the benefit to compute on a GPU? The computation of addition shouldn’t get faster because of devices.


(Harshil Jain) #14

Computation of addition is not the roadblock. There is no space left for the computation on the gpu. That is the issue which I am trying to solve.

I hope it’s clear now.


(Sebastian Raschka) #15

You actually can use a different device as default device for data parallelism. However, this must be one of the GPUs that is also used by DataParallel. You can set it via the output_device parameter.

Regarding using a GPU that is not wrapped by DataParallel, that’s currently not possible (see discussion here: Uneven GPU utilization during training backpropagation)


(Erick Guan) #16

I finally did the code reading on DataParallel part.