The example on Hogwild! gives 99% accuracy, but when I upgrade to multi-gpu versions, it gives 11% accuracy.
The difference is as follows:
# main.py model = Net() # not to specific device any more model.share_memory() processes =  for rank in range(args.num_processes): local_device = torch.device(rank%2) p = mp.Process(target=train, args=(rank, args, model, local_device, dataset1, kwargs)) p.start() processes.append(p)
And I move the model to device in the subprocesses.
# train.py def train(rank, args, model, device, dataset, dataloader_kwargs): model = model.to(device) # move to specific device in the sub-process torch.manual_seed(args.seed + rank) train_loader = torch.utils.data.DataLoader(dataset, **dataloader_kwargs) optimizer = optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum) for epoch in range(1, args.epochs + 1): train_epoch(epoch, args, model, device, train_loader, optimizer)
It seems the model is not shared any more.
- Where are the mistakes?
- What is the correct steps?