Is there any approach in pytorch to achieve distributed network connection between CPU prediction and GPU training?

Out team is planning to use CPUs from multiple computers to do network prediction and data production, and then use a single GPU server to do network training. Is there any method we can use in torch.distributed package that can help us with this situation?

torch.distributed.rpc should be able to help. Here is a list of tutorials.

The use case looks similar to the following two examples:

  1. https://pytorch.org/tutorials/intermediate/rpc_tutorial.html#distributed-reinforcement-learning-using-rpc-and-rref
  2. https://pytorch.org/tutorials/intermediate/rpc_async_execution.html#batch-processing-cartpole-solver

Thank you very much, I’ll look into them carefully!