Multi-machine inference with PyTorch

Luan_Goncalves · February 19, 2020, 5:03pm

Hi, I’m new to distributed computation on PyTorch.
I’m interested in perform a network partitioning so one piece of the network will run on the machine A and the other piece of the network will run on the machine B. The first thing I need to do is to send tensors from machine A to machine B.
So I thought about use the point-to-point communication as in Writing Distributed Applications with PyTorch. I’m trying to adapt this code to send messages between the machines A and B but I have not been well succeed. Can anyone explain the whole pipeline for this?
Any help would be appreciated!

ptrblck · February 20, 2020, 2:28am

If you want to use model sharding, this simple example might be useful.
The linked tutorial explains a distributed setup, so let me know, if I misunderstood your use case.

Luan_Goncalves · February 20, 2020, 12:24pm

First of all, thanks for your attention.
It is not exactly what I need but it helped me in another point. So thanks again.
My issue is related to edge computing. Basically I need to run just a couple of layers in a Drone and the other layers will run in a machine equipped with a GPU.
So i thought that i could send messages from Drone to my machine.
It is possible to be made with PyTorch?

ptrblck · February 21, 2020, 6:19am

That’s a really interesting use case, but I’m not really sure, how well this would work.
You could most likely connect the drone and your workstation to the same network and use DDP indeed.
However, have you thought about the latency this would create?
How long can you wait for the response?

Luan_Goncalves · February 27, 2020, 12:46pm

Yes! You are right.
Actually, I am interested on measuring this kind of problems because my research is concerned on 5G systems.

osalpekar · February 27, 2020, 6:29pm

I think what you may be looking for is our Distributed RPC framework (https://pytorch.org/tutorials/intermediate/rpc_tutorial.html?highlight=rpc), which allows you to send messages and tensors between workers. Also see the Distributed Autograd Framework (https://pytorch.org/docs/master/rpc.html#distributed-autograd-framework) for training models that are partitioned across machines. Lastly, here is an example of training an RNN using RPC/Distributed Autograd: https://github.com/pytorch/examples/tree/master/distributed/rpc/rnn.

Luan_Goncalves · March 2, 2020, 12:08pm

Thanks @osalpekar! It helped me a lot.
Actually, I want to thank you both for your attention.