Suppose that I have the following DDP model:
import torch
import torch.nn as nn
from torch.nn.parallel import DistributedDataParallel as DDP
class Foo(nn.Module): # this is a demo, with dims only provided for context
def __init__(self):
super().__init__() # Initialize the parent class
self.block_a = nn.Linear(256, 32)
self.block_b = nn.Sequential(
nn.Linear(256, 128),
nn.Linear(128, 64),
nn.Linear(64, 32)
)
def forward(self, x):
x_a = self.block_a(x)
x_b = self.block_b(x)
return x_a + x_b
# Assuming rank is defined and initialized elsewhere in your distributed setup
rank = 0 # Replace with the actual rank if needed
foo = Foo()
ddp_foo = DDP(foo, device_ids=[rank])
I have the following questions:
- Is there a way, so that
block_aandblock_bare concurrently executed in each DDP process? Thus, for each rank, the model computation is also parallelize alongx_aandx_b. - Since
block_bis large thanblock_a, I am considering runningblock_b(which has three layers) with FSDP while keepingblock_aDDP. Is this doable, and how would I approach this? - This is not directly related to the model shown, but I am wondering if it is possible to mix data parallelism and tensor parallelism together? If so, are there any intuitive examples?
Thanks.