Simple "Hello World" to run on multiple PCs in local network

I have a home local network with a Windws Intel-powered PC and MacBook with M2.

I want to test distributed training on CPU and train the network on both computers. However, I don’t understand how to run it.

There are multiple tutorials, each with slightly different information.
For example, some use torch.distributed.launch, which complaints to replaced by torchrun.
Als, it seems that all of them are using one or more GPUs, but none is CPU-only.

I have read Getting Started with Distributed Data Parallel — PyTorch Tutorials 2.0.1+cu117 documentation and copied code from Basic Use Case with ToyModel

class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10, 10)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(10, 5)

    def forward(self, x):
        return self.net2(self.relu(self.net1(x)))

Based on this examples/ at main · pytorch/examples · GitHub, I have used:

#modified to use only CPU
def demo_basic(local_world_size, local_rank):
        f"[{os.getpid()}] rank = {dist.get_rank()}, "
        + f"world_size = {dist.get_world_size()}, local_world_size={local_world_size}, local_rank={local_rank}"

    model = ToyModel().to('cpu')
    ddp_model = DDP(model)

    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

    outputs = ddp_model(torch.randn(20, 10))
    labels = torch.randn(20, 5).to('cpu')
    loss_fn(outputs, labels).backward()
def spmd_main(local_world_size, local_rank):
    # These are the parameters used to initialize the process group
    env_dict = {
        key: os.environ[key]
        for key in ("MASTER_ADDR", "MASTER_PORT", "RANK", "WORLD_SIZE")
    print(f"[{os.getpid()}] Initializing process group with: {env_dict}")
        f"[{os.getpid()}] world_size = {dist.get_world_size()}, "
        + f"rank = {dist.get_rank()}, backend={dist.get_backend()}"

    demo_basic(local_world_size, local_rank)

    # Tear down the process group

And in main I have

if __name__=="__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--local_rank", type=int, default=0)
    parser.add_argument("--local_world_size", type=int, default=1)
    args = parser.parse_args()
    spmd_main(args.local_world_size, args.local_rank)

However, I don’t know how to run this code, so it spawns on my host PC and also on Mac. I am not entirely familiar with networks and I have only very basic knowledge.

A few suggestions:

I have a home local network with a Windws Intel-powered PC and MacBook with M2

If you are using multiple hosts, then you will need to find the private IP address of one of the computers and make sure that the other host can access it (use ping). Then you should set MASTER_ADDR to the IP address and MASTER_PORT to an available port on that machine


This line should be updated to dist.init_process_group(backend="gloo") as nccl is only for cuda tensors

However, I don’t know how to run this code, so it spawns on my host PC and also on Mac.

If the code is all in a file, e.g. then you would just need to run python on each machine. Granted that you have all the dependencies (e.g. torch) installed on both machines.

All that being said, the simpler way to try out distributed training, especially if using CPUs is to just use 1 host (1 machine) and create multiple processes.