Distributed torch

Russel_Russel · April 26, 2017, 2:52am

I understand that torch distributed is still experimental. But, I am trying to run it on a Hadoop cluster, according to the examples in this page: https://github.com/pytorch/pytorch/issues/241. But nothing seems to work. when I run “torch.distributed.init_process_group(backend=‘tcp’)”, I get this error “AttributeError: module ‘torch._C’ has no attribute ‘_dist_init_process_group’”. Any suggestions? Thanks for the help.

apaszke · April 26, 2017, 4:06pm

Distributed C code isn’t built by default. You’d need to use WITH_DISTRIBUTED=1 env var when building

apaszke · April 26, 2017, 4:07pm

Also, TCP backend is likely to be quite slow, especially that the one in the main repo is a bit outdated. We’ve recently added support for Gloo which should be fast, and MPI is a reasonable default.