\I’m no expert in distributed system and CUDA. But there is one really interesting feature that PyTorch support which is nn.DataParallel and nn.DistributedDataParallel. How they are actually implemented? How they separate common embeddings and synchronize data?
Here is a basic example of DataParallel.
import torch.nn as nn
from torch.autograd.variable import Variable
import numpy as np
class Model(nn.Module):
def __init__(self):
super().__init__(
embedding=nn.Embedding(1000, 10),
rnn=nn.Linear(10, 10),
)
def forward(self, x):
x = self.embedding(x)
x = self.rnn(x)
return x
model = nn.DataParallel(Model())
model.forward(Variable.from_numpy(np.array([1,2,3,4,5,6], dtype=np.int64)).cuda()).cpu()
PyTorch can split the input and send them to many GPUs and merge the results back.
How does it manage embeddings and synchronization for a parallel model or a distributed model?
I wandered around PyTorch’s code but it’s very hard to know how the fundamentals work.
I am not sure about DistributedParallel but in DataParallel each GPU gets a copy of the model, so, the parallelization is done via splitting the minibatches, not the layers/weights.
Here’s a sketch of how DataParallel works, assuming 4 GPUs where GPU:0 is the default GPU.
Compute the loss with respect to the network outputs on the default GPU and scatter the losses back to the individual GPUs to compute the gradients with respect to the leaf nodes
why to gather all outputs into GPU0 in step5 , it merely calculate distance of output_i with target_i ? why not calculate it within each GPU separately ?
one possible reason : all target (ground truth) are just stored in GPU0 , however , target can also be scattered as a addition of minibatch, like scatter data chunk , in step1
why to gather all outputs into GPU0 in step5 , it merely calculate distance of output_i with target_i ? why not calculate it within each GPU separately ?
It’s more like an implementation side-effect. In fact, you can do what you proposed, but you would have to rewrite your code then to compute the loss inside the model – usually, the loss is computed in the training loop based on the outputs of the model.
I mean, it’s not a problem at all to do what you propose, but it’s a bit less convenient, because when you decide to use DataParallel (occasionally), you would have to modify your model code as well.
Also, computing the loss is super cheap, so there is not much speed gain in distributing it across the devices, I suppose.
sorry ,I can’t agree with this explanation , thrifting or economy the distributing cluster’s compute energy . Because transportation data is not cheaper than calculating within micro chip , in my opinion .
is there some welcomed link or chart to strengthen ?
sorry ,I can’t agree with this explanation , thrifting or economy the distributing cluster’s compute energy . Because transportation data is not cheaper than calculating within micro chip , in my opinion .
is there some welcomed link or chart to strengthen ?
I don’t have a chart handy, but when I ran the code on my workstation (only 4x 1080Tis), I didn’t notice any difference between letting the GPU handle the loss computation versus distributing the results, calculating the results on the individual GPUs, and then gathering them. Either way, I get the same ~3x speedup when using 4 instead of 1 GPU. This is a small hardware setup, and the scenario is probably different for large datacenters, like you said… Also, the way I implemented it,
model = MyModel(num_features=num_features, num_classes=num_classes)
if torch.cuda.device_count() > 1:
print("Using", torch.cuda.device_count(), "GPUs")
model = nn.DataParallel(model)
and then
for epoch in range(NUM_EPOCHS):
model.train()
for batch_idx, (features, targets) in enumerate(train_loader):
features = features.to(DEVICE)
targets = targets.to(DEVICE)
### FORWARD AND BACK PROP
logits, probas = model(features)
cost = cost_fn(logits, targets)
optimizer.zero_grad()
cost.backward()
### UPDATE MODEL PARAMETERS
optimizer.step()
Is basically such that I don’t have to rewrite my model when I decide to use DataParallel, which is most convenient for my use cases since I do not always run DataParallel.
Thanks for the detailing here.
How can we put the computation part of the losses on another gpu(GPU:4) rather than GPU:0 which happens by default.
It will be great to have the method as many networks are designed such that they cover ~11 gigs and this things holds you back while parallelizing such a network.
It doesn’t makes sense. If the loss aggregation is the bottleneck, what’s the benefit to compute on a GPU? The computation of addition shouldn’t get faster because of devices.
You actually can use a different device as default device for data parallelism. However, this must be one of the GPUs that is also used by DataParallel. You can set it via the output_device parameter.
As shown in the diagram you posted earlier on, “Step 5” says that “loss [is computed] with respect to network outputs on default GPU”. My question is, are losses computed for each network eventually averaged or summed over? I am not sure what exactly my loss function is printing.