TypeError in dist_autograd.backward(cid, [loss])

Khairul_Mottakin · September 13, 2020, 3:01pm

Yah, it is working locally after updating pytorch version. Many many thanks @mrshenli and @rvarm1 for your contribution and helping us to implement distributed system in our own ways.
While started training from remote worker, it says “RuntimeError: […/third_party/gloo/gloo/transport/tcp/pair.cc:769] connect [127.0.1.1]:14769: Connection refused”. The same issue had been raised by @Oleg_Ivanov in here. I am not sure whether it has been solved. Should I use "os.environ[‘GLOO_SOCKET_IFNAME’]=‘nonexist’ " ?

Can you suggest any tutorial for building such smaller cluster (2-3 remote workers with 1 master PS) to implement the Parameter Server using RPC of PyTorch?

Thank you very much once again.