Greetings,
I am trying to implement ParameterServer following this tutorial
I have two observations
1- The execution is too slow. To train one epoch it takes 2-3 minutes, while in DDP with NCCL it takes 30 seconds.
2- I noticed that the communication backend is GLOO. How to switch it to NCCL?
Also, any other recommendations to speedup the training?
Best
If there are features requests feel free to create an issue on github or perhaps look into trying to implement parameter server architecture with just collective communication.