def run(rank, size):
...
model = Net()
...
if __name__ == '__main__':
...
for rank in range(size):
p = Process(target=init_process, args=(rank, size, run))
...
if __name__ == '__main__':
...
model = Net().to(device)
...
for rank in range(args.num_process):
p = mp.Process(target=train, args=(rank, args, model, device, dataloader_kwargs))
...
So I just wonder: where should the model be declared, inside or outside these processes? When and how do the gradients get synchronized?
If you don’t care for doing hogwild, the second example you list is not applicable to you. Everything is declared in the primary process because the model weights are to be shared (physically shared through shared memory) between the worker processes.
If you want to simply use a single process per GPU and don’t care for physical weight sharing, then you should declare everything in the subprocesses. Think of it as if you were running this on different machines instead of multiple processes on a single machine. Then you’d also have to declare everything in the processes that you launch on those machine instead of a single launcher processes.
Big thanks to your reply! So hogwild is not a must for distributed data parallel, even if the models are declared in different processes, they can synchronize their parameters and gradients in some way. Is it right?
I tried saving the models at the end of these processes using the following code, but I found the parameters of the models are different. Does this mean that I made something wrong?
def run(rank, size):
...
model = Net()
...
for epoch in range(args.epochs):
for batch_idx, (input, target) in enumerate(train_loader):
output = model(input)
loss = criterion(output, target)
optimizer.zero_grad()
loss.backward()
optimizer.step()
...
model.save({...}, f'checkpoint-process-{rank}.pth')