Accumulating local gradients to global shared model

jleguina0 · August 19, 2021, 11:11pm

Hi everyone!

First post here so apologies if it is unclear. I am currently facing a problem similar to what was mentioned here (Share gradients in multiprocessing). Given that no answer was provided I will try to further specify the problem.

I have implemented a Hogwild model training procedure. I am trying to modify it such that gradients calculated with local models over N cores can be accumulated/averaged onto the shared model. After aggregating the gradients from N-1 cores, I want the Nth core to update the shared model given such accumulated/averaged gradients.

An overview of what I am considering is as follows:

import torch.multiprocessing as mp
from model import MyModel

def train(shared_model, shared_optimizer, rank):
    # Construct data_loader, optimizer, etc.
    local_model = MyModel()
    for data, labels in data_loader:
        local_model.load_state_dict(shared_model.state_dict())  # Ensure local model up to date
        loss_fn(local_model(data), labels).backward()
        sync_grads(local_model, shared_model)
        if rank == 0:
            shared_optimizer.step()  # This will update the shared parameters
            shared_optimizer.zero_grad()

def sync_grads(local_model, shared_model):
    for local_param, shared_param in zip(local_model.parameters(), shared_model.parameters()):
        shared_param._grad = shared_param.grad + local_param.grad  # Accumulating in this case

if __name__ == '__main__':
    num_processes = 4
    shared_model = MyModel()
    shared_model.share_memory()
    
    # SharedAdam: https://github.com/ikostrikov/pytorch-a3c/blob/master/my_optim.py
    shared_optimizer = SharedAdam(shared_model.parameters())  # Note shared model parameters
    shared_optimizer.share_memory()
    processes = []
    for rank in range(num_processes):
        p = mp.Process(target=train, args=(shared_model, shared_optimizer, rank, ))
        p.start()
        processes.append(p)
    for p in processes:
        p.join()

My questions are:

Will sync_grads() upload gradients to the shared model (and thus grads will be shared among processes)?
Or will this only aggregate gradients to a local version of the shared_model, leading to the update only really occurring with gradients computed for the rank0 process?
In other words, does putting a shared_model on shared memory share parameters and gradients or only parameters?

Finally, if my fears are correct, how can I do this? I have seen this:

""" Gradient averaging. """
def average_gradients(model):
    size = float(dist.get_world_size())
    for param in model.parameters():
        dist.all_reduce(param.grad.data, op=dist.ReduceOp.SUM)
        param.grad.data /= size

However, I am skeptical of whether it will work (last update to the tutorial is from 2017). Else, I am also considering moving grads to an mp.Queue for all processes, then query, average & assign to the local version of shared_model in rank0 and gradient step on that process only.

Thanks a lot!

gcramer23 · August 25, 2021, 12:53am

@jleguina0 Does DDP fit your need Getting Started with Distributed Data Parallel — PyTorch Tutorials 1.9.0+cu102 documentation? You will have one model replica per process. Instead of storing the gradients in a shared model DDP will update all the model replicas. You can define a communication hook DDP Communication Hooks — PyTorch 1.9.0 documentation for gradient communication logic.

jleguina0 · August 25, 2021, 8:11am

@gcramer23 That is exactly what I was looking for! Thanks.