Machine A running on GCP (VM) and machine B running locally (laptop)

Hi there,

So I was playing around with this tutorial and took the code from here, and got it working fine locally (single machine without the GCP).

So my idea was to actually have rank=0 running in the VM on GCP and have rank=1 run on my laptop. This means that both of these workers are on completely different networks.

I am unable to get this setup working, it seems to be hanging when I run the worker with rank=1.

This is my setup and the changes I made in as much detail as possible.

  • For the tutorial code, the only change I made was changing port 29500 to port 5000.
  • The /etc/hosts file got a new entry on my local machine (laptop). Specifically the IP of the NIC, for me, it was wlp2s0 and let’s assume the address was 11.22.33.44. So the /etc/hosts file would have the new entry 11.22.33.44 mycomputer.
  • The /etc/hosts file got a new entry in my VM on GCP, but this is set by default. The NIC for the VM seems to be ens4? And let’s assume the IP address is 44.33.22.11 and let’s also assume the IP address of the VM is 33.33.33.33. So the new entry of the /etc/hosts file would be 44.33.22.11 gcp-vm
  • I also made sure the ports of the VM are open and listening, so I updated the firewall settings and to verify this I simply created a flask server and queried the IP, in my case it would be http:// 33.33.33.33:5000

I’m not surprises it’s hanging I think the way how I’m running the parameter server isn’t the correct way, but I’m unsure. What are the correct changes I should make to get this working properly?

Do you want to have a way to let the two machines ping each other and make sure the network connection between them is working?

Hi there @Yanli_Zhao! Thanks for the reply. Yes, so basically it seems the two machines are not communicating with each other. I want to figure out how to get these two machines (one running on GCP and the other being my laptop (in a different network)) to be able to communicate and train the MNIST model using the RPC Framework. Or is this something that the Pytorch RPC is unable to do currently in this manner?

Did that answer your question @Yanli_Zhao? Or did I miss something? I can give any extra information that is needed.

I think the laptop and VM should physically be in the same network, otherwise the two nodes will not get hand shake and get connected via RPC framework.