Gloo doesn't support reduce operation

Gloo doesn’t support reduce operation. How to use tcp backend to transfer parameters through dist.reduce and dist.broadcast? For the GPU tensor.

    for param in model.parameters():
        param.data = param.data.cpu()
        dist.reduce(param.data, dst=0, op=dist.reduce_op.SUM, group=group)
        if dist.get_rank()==0:
            param.data = param.data / size
        dist.broadcast(param.data, src=0, group=group)
        param.data = param.data.cuda()

I use the code like this, but it cost a lot of cpu, which is slower than before.

Really appreciate for any help!

Any one have good idea?

If you’re trying to average gradients across processes do this instead (it’s going to be faster than reduce + broadcast):

grad = param.grad
avg_grad = grad / num_processes
dist.all_reduce(avg_grad)

Also, you might want to use torch.nn.parallel.DistributedDataParallel instead. It will automatically optimize the transfers to give you better performance

I want to implement the parameter server architecture. So I need to send all gradients/parameters to parameter server and average them. Then, I send back to the worker. I find the reduce operation is similar with what I want, but it is different. And all_reduce looks like a operation that each worker communicates among workers, does this setting cost more communication cost? http://pytorch.org/tutorials/intermediate/dist_tuto.html

Does pytorch support parameter server architecture? Parameter server and workers, like mu’s paper