Hi there,
So I was playing around with this tutorial and took the code from here, and got it working fine locally (single machine without the GCP).
So my idea was to actually have rank=0 running in the VM on GCP and have rank=1 run on my laptop. This means that both of these workers are on completely different networks.
I am unable to get this setup working, it seems to be hanging when I run the worker with rank=1.
This is my setup and the changes I made in as much detail as possible.
- For the tutorial code, the only change I made was changing port 29500 to port 5000.
- The /etc/hosts file got a new entry on my local machine (laptop). Specifically the IP of the NIC, for me, it was wlp2s0 and let’s assume the address was 11.22.33.44. So the /etc/hosts file would have the new entry 11.22.33.44 mycomputer.
- The /etc/hosts file got a new entry in my VM on GCP, but this is set by default. The NIC for the VM seems to be ens4? And let’s assume the IP address is 44.33.22.11 and let’s also assume the IP address of the VM is 33.33.33.33. So the new entry of the /etc/hosts file would be 44.33.22.11 gcp-vm
- I also made sure the ports of the VM are open and listening, so I updated the firewall settings and to verify this I simply created a flask server and queried the IP, in my case it would be http:// 33.33.33.33:5000
I’m not surprises it’s hanging I think the way how I’m running the parameter server isn’t the correct way, but I’m unsure. What are the correct changes I should make to get this working properly?