Hi everyone!
First post here so apologies if it is unclear. I am currently facing a problem similar to what was mentioned here (Share gradients in multiprocessing). Given that no answer was provided I will try to further specify the problem.
I have implemented a Hogwild model training procedure. I am trying to modify it such that gradients calculated with local models over N cores can be accumulated/averaged onto the shared model. After aggregating the gradients from N-1 cores, I want the Nth core to update the shared model given such accumulated/averaged gradients.
An overview of what I am considering is as follows:
import torch.multiprocessing as mp
from model import MyModel
def train(shared_model, shared_optimizer, rank):
# Construct data_loader, optimizer, etc.
local_model = MyModel()
for data, labels in data_loader:
local_model.load_state_dict(shared_model.state_dict()) # Ensure local model up to date
loss_fn(local_model(data), labels).backward()
sync_grads(local_model, shared_model)
if rank == 0:
shared_optimizer.step() # This will update the shared parameters
shared_optimizer.zero_grad()
def sync_grads(local_model, shared_model):
for local_param, shared_param in zip(local_model.parameters(), shared_model.parameters()):
shared_param._grad = shared_param.grad + local_param.grad # Accumulating in this case
if __name__ == '__main__':
num_processes = 4
shared_model = MyModel()
shared_model.share_memory()
# SharedAdam: https://github.com/ikostrikov/pytorch-a3c/blob/master/my_optim.py
shared_optimizer = SharedAdam(shared_model.parameters()) # Note shared model parameters
shared_optimizer.share_memory()
processes = []
for rank in range(num_processes):
p = mp.Process(target=train, args=(shared_model, shared_optimizer, rank, ))
p.start()
processes.append(p)
for p in processes:
p.join()
My questions are:
- Will
sync_grads()
upload gradients to the shared model (and thus grads will be shared among processes)? - Or will this only aggregate gradients to a local version of the
shared_model
, leading to the update only really occurring with gradients computed for the rank0 process? - In other words, does putting a
shared_model
on shared memory share parameters and gradients or only parameters?
Finally, if my fears are correct, how can I do this? I have seen this:
""" Gradient averaging. """
def average_gradients(model):
size = float(dist.get_world_size())
for param in model.parameters():
dist.all_reduce(param.grad.data, op=dist.ReduceOp.SUM)
param.grad.data /= size
However, I am skeptical of whether it will work (last update to the tutorial is from 2017). Else, I am also considering moving grads to an mp.Queue
for all processes, then query, average & assign to the local version of shared_model
in rank0 and gradient step on that process only.
Thanks a lot!