Parallelizing Two Concurrent Blocks By Combining Different Parallelization Strategies

Suppose that I have the following DDP model:

import torch
import torch.nn as nn
from torch.nn.parallel import DistributedDataParallel as DDP

class Foo(nn.Module):  # this is a demo, with dims only provided for context
    def __init__(self):
        super().__init__()  # Initialize the parent class
        self.block_a = nn.Linear(256, 32)
        self.block_b = nn.Sequential(
            nn.Linear(256, 128),
            nn.Linear(128, 64),
            nn.Linear(64, 32)
        )

    def forward(self, x):
        x_a = self.block_a(x)
        x_b = self.block_b(x)
        return x_a + x_b

# Assuming rank is defined and initialized elsewhere in your distributed setup
rank = 0  # Replace with the actual rank if needed
foo = Foo()
ddp_foo = DDP(foo, device_ids=[rank])

I have the following questions:

  1. Is there a way, so that block_a and block_b are concurrently executed in each DDP process? Thus, for each rank, the model computation is also parallelize along x_a and x_b.
  2. Since block_b is large than block_a, I am considering running block_b (which has three layers) with FSDP while keeping block_a DDP. Is this doable, and how would I approach this?
  3. This is not directly related to the model shown, but I am wondering if it is possible to mix data parallelism and tensor parallelism together? If so, are there any intuitive examples?

Thanks.

Using data parallel alone (FSDP or DDP), your model would be replicated on each device, and would always execute the whole computation. It is the data that is sharded across devices and being computed simultaneously.

In other words, do you really need to process the same data but with parallelized computations? Can you shard your data so that DDP alone would solve your problem?

I suppose sharding my data does not really solve the issue. On further examination, I think this is more of a problem of tensor parallelism on block_a and block_b concurrently.

Let me re-frame the problem. Suppose that I have 8 GPUs, with 1 process launched per GPU. I want:

  • process rank 0-1 to run tensor parallelism on block_a,
  • process rank 2-7 to run tensor parallelism on block_b,
  • block_a and block_b should be concurrently computed maximize overlap time.
    How would this be done?

So what you’d like to achieve is a special form of Model Parallel, not Data Parallel. AFAIK pytorch doesn’t have high level APIs for such use cases.

I believe you can write customized (autograd) functions to explicitly insert correct communication collectives to achieve it. E.g.

if current_rank == 0 or current_rank == 1:
  x_a = self.block_a(x)
  x_b = 0
else:
  x_a = 0
  x_b = self.block_b(x)

# run all-reduce on x_a and x_b

return x_a + x_b

Besides, using PyTorch DTensor would help you avoid writing backward functions on your own.