I have a home local network with a Windws Intel-powered PC and MacBook with M2.
I want to test distributed training on CPU and train the network on both computers. However, I don’t understand how to run it.
There are multiple tutorials, each with slightly different information.
For example, some use torch.distributed.launch, which complaints to replaced by torchrun.
Als, it seems that all of them are using one or more GPUs, but none is CPU-only.
I have read Getting Started with Distributed Data Parallel — PyTorch Tutorials 2.0.1+cu117 documentation and copied code from Basic Use Case with ToyModel
class ToyModel(nn.Module):
def __init__(self):
super(ToyModel, self).__init__()
self.net1 = nn.Linear(10, 10)
self.relu = nn.ReLU()
self.net2 = nn.Linear(10, 5)
def forward(self, x):
return self.net2(self.relu(self.net1(x)))
Based on this examples/README.md at main · pytorch/examples · GitHub, I have used:
#modified to use only CPU
def demo_basic(local_world_size, local_rank):
print(
f"[{os.getpid()}] rank = {dist.get_rank()}, "
+ f"world_size = {dist.get_world_size()}, local_world_size={local_world_size}, local_rank={local_rank}"
)
model = ToyModel().to('cpu')
ddp_model = DDP(model)
loss_fn = nn.MSELoss()
optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)
optimizer.zero_grad()
outputs = ddp_model(torch.randn(20, 10))
labels = torch.randn(20, 5).to('cpu')
loss_fn(outputs, labels).backward()
optimizer.step()
def spmd_main(local_world_size, local_rank):
# These are the parameters used to initialize the process group
env_dict = {
key: os.environ[key]
for key in ("MASTER_ADDR", "MASTER_PORT", "RANK", "WORLD_SIZE")
}
print(f"[{os.getpid()}] Initializing process group with: {env_dict}")
dist.init_process_group(backend="nccl")
print(
f"[{os.getpid()}] world_size = {dist.get_world_size()}, "
+ f"rank = {dist.get_rank()}, backend={dist.get_backend()}"
)
demo_basic(local_world_size, local_rank)
# Tear down the process group
dist.destroy_process_group()
And in main I have
if __name__=="__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--local_rank", type=int, default=0)
parser.add_argument("--local_world_size", type=int, default=1)
args = parser.parse_args()
spmd_main(args.local_world_size, args.local_rank)
However, I don’t know how to run this code, so it spawns on my host PC and also on Mac. I am not entirely familiar with networks and I have only very basic knowledge.