\I’m no expert in distributed system and CUDA. But there is one really interesting feature that PyTorch support which is `nn.DataParallel`

and `nn.DistributedDataParallel`

. How they are actually implemented? How they separate common embeddings and synchronize data?

Here is a basic example of `DataParallel`

.

```
import torch.nn as nn
from torch.autograd.variable import Variable
import numpy as np
class Model(nn.Module):
def __init__(self):
super().__init__(
embedding=nn.Embedding(1000, 10),
rnn=nn.Linear(10, 10),
)
def forward(self, x):
x = self.embedding(x)
x = self.rnn(x)
return x
model = nn.DataParallel(Model())
model.forward(Variable.from_numpy(np.array([1,2,3,4,5,6], dtype=np.int64)).cuda()).cpu()
```

PyTorch can split the input and send them to many GPUs and merge the results back.

How does it manage embeddings and synchronization for a parallel model or a distributed model?

I wandered around PyTorch’s code but it’s very hard to know how the fundamentals work.

I’ve posted this question on StackOverflow.