torch.dist.All_reduce

DanielWang · September 28, 2021, 8:19pm

I found a problem when use torch.dist.allreduce. I want to manually reduce and sum all model parameter gradients.
This is the first solution, which can give me the correct reduced_and_sum results.

               for p in params:
                    dist.all_reduce(p.grad, op=dist.ReduceOp.SUM)

However, the below second solution does not do any reduce at all after running and returns me the same value before the reduction.

            def sync_para(x):
               dist.all_reduce(x.grad, op=dist.ReduceOp.SUM)
               x.grad = grad

             map(lambda x: sync_para(x), params)

Why map function cannot be used at here?

wanchaol · October 5, 2021, 6:37pm

@DanielWang Thanks for posting the question, in the second case, did you found there’re any errors when using it? My guess is that since map() is written in C and did some optimizations, its implied loop can be more efficient than a regular Python for loop, so the order might be different, and since all_reduce is happening in SPMD fashion, a mis-order might incur some issues.

DanielWang · October 5, 2021, 6:48pm

another solution is to use async_op=True instead of this sync operator

wanchaol · October 5, 2021, 11:12pm

I see, curious what makes async_op=True work? Another thought: does the map really running? It seems like map() will return a generator instead of really looping over the list. Did you try put print statements inside and see if it really got invoked? or wrap it with a sth like list(map(lambda x: sync_para(x), params))