How to make rpc work with WSL

I’m trying below sample on multiple machines:

When I tried on 2 linux machine in same domain, this sample works fine. But when I try on 2 WSL machine, I can’t make these 2 WSL machine connect with each other, both machines blocking waiting on rpc init.

Tried the " WSL 2 TPC NETWORK FORWARDING" workaround from below post, rpc still can’t connect with each other:

So I’m wondering if there is any working sample/setting for WSL? Any suggestions on how to trouble the issue is also highly appreciated, thanks.

We haven’t tests WSL (I assume you mean Windows Subsystem for Linux), and are not sure what are the gaps if there are any.

cc @ptrblck do you know who is familiar with WSL?

Not sure if this can help, I would try to verify if the following command resolves to the correct interface. If not, set GLOO_SOCKET_IFNAME explicitly.

getent hosts `hostname`

@mrshenli Thanks for your reply.
Yes, we are using Windows Subsystem for Linux as Windows not support distributed yet. Although I see an open PR for this (but we don’t know if rpc will be support in this PR): https://github.com/pytorch/pytorch/pull/42897
And yes, hostname can resolve correctly. One thing that different on linux machine and WSL machine we are aware of is WSL machine share ip with host windows system but need port forwarding from windows to WSL. We already tried this solution that add port forwarding but seems rpc still can’t connect.
https://github.com/microsoft/WSL/issues/4150

So question is say rank0 is listening on one port 12560 for example, is this the only port will be used on rank0 machine during rpc initialization and connection?
Also, is there any way to output detailed information from rpc during initialization so we can find more details here. Ideally we will be able to know what address/port rank0 is listening on and where other ranks are connecting to, whether there is some mismatch on the address or other things cause the rpc initialization hangs.

Unfortunately, I’m not familiar with it and don’t know a specific user who might be.
Maybe @maxiluk or @peterjc123 would know more.

1 Like

Yep, MSFT experts are helping us adding Windows support to PT Distributed. The first step focuses on DDP only, but we do plan to cover all features in the distributed package in future releases.

So question is say rank0 is listening on one port 12560 for example, is this the only port will be used on rank0 machine during rpc initialization and connection?

I see. No, that port is only used for rendezvous during initialization. RPC backend will grab a port for each pair of workers, which are not visible to users.

cc @lcw any suggestion on how RPC can work with port forwarding?

1 Like

Do we know the port range so we can have a test

I’m not super familiar with port forwarding on WSL, but I assume it does some sort of NAT, right? In that case, I suspect it would be enough if only the listening socket(s) on each server had a fixed port which is correctly forwarded? (as then all other sockets that are accepted will have a random port but the NAT would be aware of them and would handle them)

Unfortunately, at this moment, also the port of the listening socket(s) is picked at random (well, not at random, but we’re letting the kernel give us back an arbitrary available port). We don’t support a way for the user to specify a port. Which I think means there’s no real way for you to set up port forwarding before launching the application.

1 Like