Distributed torch

I understand that torch distributed is still experimental. But, I am trying to run it on a Hadoop cluster, according to the examples in this page: https://github.com/pytorch/pytorch/issues/241. But nothing seems to work. when I run “torch.distributed.init_process_group(backend=‘tcp’)”, I get this error “AttributeError: module ‘torch._C’ has no attribute ‘_dist_init_process_group’”. Any suggestions? Thanks for the help.

2 Likes

Distributed C code isn’t built by default. You’d need to use WITH_DISTRIBUTED=1 env var when building

Also, TCP backend is likely to be quite slow, especially that the one in the main repo is a bit outdated. We’ve recently added support for Gloo which should be fast, and MPI is a reasonable default.