Out team is planning to use CPUs from multiple computers to do network prediction and data production, and then use a single GPU server to do network training. Is there any method we can use in torch.distributed package that can help us with this situation?
torch.distributed.rpc
should be able to help. Here is a list of tutorials.
The use case looks similar to the following two examples:
Thank you very much, I’ll look into them carefully!