Error: address family mismatch

A couple of things here. All the error messages so far come from Gloo, whereas I mostly know TensorPipe, hence I’m not 100% sure about what I’m saying.

First, the “address family mismatch” error in my opinion comes from the fact that you specified localhost and 192.168.60.67 as the master address for your two nodes. Even though these two addresses resolve to the same physical machine, they correspond to different interfaces on that machine, hence the mismatch. In particular it’s possible that localhost resolved to the ::1 IPv6 address, and this caused Gloo to detect a mismatch. You should specify 192.168.60.67 as the master address on both nodes.

Second, your goal seems to be to have the server listen on port 60000 only and have all connections go through there. This, AFAIK, is not possible with Gloo or TensorPipe today. You are certainly allowed to specify the master port, and it will be honored, but that is only used for rendezvous, i.e., for processes to “discover” each other. In practice, each process will start listening on a new random arbitrary port (and it will communicate this port to the other processes using that rendezvous). And I believe there is no way to influence how that arbitrary port is selected. Hence you probably will need to map the whole range of ports, or find some other way to put the two machines on the same network, or something of that sort.

Finally, you seemed to be trying to connect a Linux machine to an OSX machine. While this might in principle be possible, and perhaps it might even work, I don’t think we ever explicitly supported this scenario. I wouldn’t be surprised if somewhere in the code we introduced the assumption that all endpoints are running on the same platform and, possibly, that they are running the same exact binary version of PyTorch.