Parameter Server with RPC and NCCL

Greetings,
I am trying to implement ParameterServer following this tutorial
I have two observations
1- The execution is too slow. To train one epoch it takes 2-3 minutes, while in DDP with NCCL it takes 30 seconds.
2- I noticed that the communication backend is GLOO. How to switch it to NCCL?

Also, any other recommendations to speedup the training?
Best

RPC does not use GLOO or NCCL backends and uses GitHub - pytorch/tensorpipe: A tensor-aware point-to-point communication primitive for machine learning for its backend. RPC is currently in maintenance mode but there is limited support for doing RPC using cuda tensors should be an order of magnitudes (up to x10) faster Direct Device-to-Device Communication with TensorPipe CUDA RPC — PyTorch Tutorials 2.4.0+cu124 documentation

If there are features requests feel free to create an issue on github or perhaps look into trying to implement parameter server architecture with just collective communication.