Why do most of the DDP examples not use the main process?

I was looking some of the DDP examples and noticed that most of them spawn 8 processes but then never use the main process again. e.g. Getting Started with Distributed Data Parallel — PyTorch Tutorials 1.7.1 documentation

def demo_model_parallel(rank, world_size):
    print(f"Running DDP with model parallel example on rank {rank}.")
    setup(rank, world_size)

    # setup mp_model and devices for this process
    dev0 = rank * 2
    dev1 = rank * 2 + 1
    mp_model = ToyMpModel(dev0, dev1)
    ddp_mp_model = DDP(mp_model)

    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(ddp_mp_model.parameters(), lr=0.001)

    optimizer.zero_grad()
    # outputs will be on dev1
    outputs = ddp_mp_model(torch.randn(20, 10))
    labels = torch.randn(20, 5).to(dev1)
    loss_fn(outputs, labels).backward()
    optimizer.step()

    cleanup()


if __name__ == "__main__":
    n_gpus = torch.cuda.device_count()
    if n_gpus < 8:
      print(f"Requires at least 8 GPUs to run, but got {n_gpus}.")
    else:
      run_demo(demo_basic, 8)
      run_demo(demo_checkpoint, 8)
      run_demo(demo_model_parallel, 4)

why isn’t the main process used? isn’t this wasting 1 entire processor?


related: python - Why does my pytorch rpc workers deadlock, is it because I am using main as my master? - Stack Overflow