\I’m no expert in distributed system and CUDA. But there is one really interesting feature that PyTorch support which is
nn.DistributedDataParallel. How they are actually implemented? How they separate common embeddings and synchronize data?
Here is a basic example of
import torch.nn as nn from torch.autograd.variable import Variable import numpy as np class Model(nn.Module): def __init__(self): super().__init__( embedding=nn.Embedding(1000, 10), rnn=nn.Linear(10, 10), ) def forward(self, x): x = self.embedding(x) x = self.rnn(x) return x model = nn.DataParallel(Model()) model.forward(Variable.from_numpy(np.array([1,2,3,4,5,6], dtype=np.int64)).cuda()).cpu()
PyTorch can split the input and send them to many GPUs and merge the results back.
How does it manage embeddings and synchronization for a parallel model or a distributed model?
I wandered around PyTorch’s code but it’s very hard to know how the fundamentals work.
I’ve posted this question on StackOverflow.