Multi-machine inference with PyTorch

Hi, I’m new to distributed computation on PyTorch.
I’m interested in perform a network partitioning so one piece of the network will run on the machine A and the other piece of the network will run on the machine B. The first thing I need to do is to send tensors from machine A to machine B.
So I thought about use the point-to-point communication as in Writing Distributed Applications with PyTorch. I’m trying to adapt this code to send messages between the machines A and B but I have not been well succeed. Can anyone explain the whole pipeline for this?
Any help would be appreciated!

If you want to use model sharding, this simple example might be useful.
The linked tutorial explains a distributed setup, so let me know, if I misunderstood your use case.

First of all, thanks for your attention.
It is not exactly what I need but it helped me in another point. So thanks again.
My issue is related to edge computing. Basically I need to run just a couple of layers in a Drone and the other layers will run in a machine equipped with a GPU.
So i thought that i could send messages from Drone to my machine.
It is possible to be made with PyTorch?

That’s a really interesting use case, but I’m not really sure, how well this would work.
You could most likely connect the drone and your workstation to the same network and use DDP indeed.
However, have you thought about the latency this would create?
How long can you wait for the response?

Yes! You are right.
Actually, I am interested on measuring this kind of problems because my research is concerned on 5G systems.

I think what you may be looking for is our Distributed RPC framework (https://pytorch.org/tutorials/intermediate/rpc_tutorial.html?highlight=rpc), which allows you to send messages and tensors between workers. Also see the Distributed Autograd Framework (https://pytorch.org/docs/master/rpc.html#distributed-autograd-framework) for training models that are partitioned across machines. Lastly, here is an example of training an RNN using RPC/Distributed Autograd: https://github.com/pytorch/examples/tree/master/distributed/rpc/rnn.

Thanks @osalpekar! It helped me a lot.
Actually, I want to thank you both for your attention.

1 Like