Gloo doesn't support reduce operation

lkywk · December 18, 2017, 8:00am

Gloo doesn’t support reduce operation. How to use tcp backend to transfer parameters through dist.reduce and dist.broadcast? For the GPU tensor.

    for param in model.parameters():
        param.data = param.data.cpu()
        dist.reduce(param.data, dst=0, op=dist.reduce_op.SUM, group=group)
        if dist.get_rank()==0:
            param.data = param.data / size
        dist.broadcast(param.data, src=0, group=group)
        param.data = param.data.cuda()

I use the code like this, but it cost a lot of cpu, which is slower than before.

Really appreciate for any help!

Any one have good idea?

apaszke · December 18, 2017, 5:57pm

If you’re trying to average gradients across processes do this instead (it’s going to be faster than reduce + broadcast):

grad = param.grad
avg_grad = grad / num_processes
dist.all_reduce(avg_grad)

Also, you might want to use torch.nn.parallel.DistributedDataParallel instead. It will automatically optimize the transfers to give you better performance

lkywk · December 19, 2017, 1:33am

I want to implement the parameter server architecture. So I need to send all gradients/parameters to parameter server and average them. Then, I send back to the worker. I find the reduce operation is similar with what I want, but it is different. And all_reduce looks like a operation that each worker communicates among workers, does this setting cost more communication cost? http://pytorch.org/tutorials/intermediate/dist_tuto.html

Does pytorch support parameter server architecture? Parameter server and workers, like mu’s paper