Parallelizing Two Concurrent Blocks By Combining Different Parallelization Strategies

Suppose that I have the following DDP model:

import torch
import torch.nn as nn
from torch.nn.parallel import DistributedDataParallel as DDP

class Foo(nn.Module):  # this is a demo, with dims only provided for context
    def __init__(self):
        super().__init__()  # Initialize the parent class
        self.block_a = nn.Linear(256, 32)
        self.block_b = nn.Sequential(
            nn.Linear(256, 128),
            nn.Linear(128, 64),
            nn.Linear(64, 32)
        )

    def forward(self, x):
        x_a = self.block_a(x)
        x_b = self.block_b(x)
        return x_a + x_b

# Assuming rank is defined and initialized elsewhere in your distributed setup
rank = 0  # Replace with the actual rank if needed
foo = Foo()
ddp_foo = DDP(foo, device_ids=[rank])

I have the following questions:

  1. Is there a way, so that block_a and block_b are concurrently executed in each DDP process? Thus, for each rank, the model computation is also parallelize along x_a and x_b.
  2. Since block_b is large than block_a, I am considering running block_b (which has three layers) with FSDP while keeping block_a DDP. Is this doable, and how would I approach this?
  3. This is not directly related to the model shown, but I am wondering if it is possible to mix data parallelism and tensor parallelism together? If so, are there any intuitive examples?

Thanks.