Research Distributed learning platform

My research is about distributed deep learning, and I am looking for a research developing platform in which new ideas could be implemented. Regarding complex architecture, we should have more access to communications between nodes and GPUs.
I found TF-Replicator from Tensorflow, I saw Pytorch RPC and DDP APIs, and also Ray that is a new designed intesteing platform for emerged AI.

I could found any example and tutorial (Good ones) for TF-Repliicato or Ray. However, Pytorch has some good documents and I feel it gives more control over how we implement more complicated architecture and communication schema.

Therefore It would be great if others can also share their thoughts.
Thank you

Hey @sakh251

I just wanna mention one relevant WIP project that we are working on. We are running internal reviews on the code-level design on it, and should be able to share on Github in the next few weeks. This should help make customizing DDP a lot easier.

@sakh251 In terms of TF-Replicator (which is now part of tf.distribute.Strategy) are you referring to dedicated support for Parameter Server based training?

I am looking for some high level api which can control the behaviour of learning. For example sending more data along with model or gradient. Or output of one network should feed to other network like GANs or Autoencoder when networks are on different machine. In this case we should have something like parameter server with more control. Now the question is which of these platforms provide more flexible high level api that is suitable for researchers? Not just use the developed strategies.

Hi @mrshenli it sounds interesting, I hope it would be available soon.