import torch
import torch.nn as nn
from torch.nn.parallel import DistributedDataParallel as DDP
class Foo(nn.Module): # this is a demo, with dims only provided for context
def __init__(self):
super().__init__() # Initialize the parent class
self.block_a = nn.Linear(256, 32)
self.block_b = nn.Sequential(
nn.Linear(256, 128),
nn.Linear(128, 64),
nn.Linear(64, 32)
)
def forward(self, x):
x_a = self.block_a(x)
x_b = self.block_b(x)
return x_a + x_b
# Assuming rank is defined and initialized elsewhere in your distributed setup
rank = 0 # Replace with the actual rank if needed
foo = Foo()
ddp_foo = DDP(foo, device_ids=[rank])
I have the following questions:
Is there a way, so that block_a and block_b are concurrently executed in each DDP process? Thus, for each rank, the model computation is also parallelize along x_a and x_b.
Since block_b is large than block_a, I am considering running block_b (which has three layers) with FSDP while keeping block_a DDP. Is this doable, and how would I approach this?
This is not directly related to the model shown, but I am wondering if it is possible to mix data parallelism and tensor parallelism together? If so, are there any intuitive examples?
Using data parallel alone (FSDP or DDP), your model would be replicated on each device, and would always execute the whole computation. It is the data that is sharded across devices and being computed simultaneously.
In other words, do you really need to process the same data but with parallelized computations? Can you shard your data so that DDP alone would solve your problem?
I suppose sharding my data does not really solve the issue. On further examination, I think this is more of a problem of tensor parallelism on block_a and block_b concurrently.
Let me re-frame the problem. Suppose that I have 8 GPUs, with 1 process launched per GPU. I want:
process rank 0-1 to run tensor parallelism on block_a,
process rank 2-7 to run tensor parallelism on block_b,
block_a and block_b should be concurrently computed maximize overlap time.
How would this be done?