[c10d] The hostname of the client socket cannot be retrieved. err=-3

I encountered this issue when running a training using accelerate in kaggle notebook.

W502 03:41:26.169817470 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W502 03:41:26.250752402 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W502 03:41:26.771717561 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W502 03:41:27.082920912 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W502 03:41:27.839936719 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W502 03:41:28.904583608 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W502 03:41:29.049047254 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W502 03:41:30.348198113 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W502 03:41:30.418451977 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W502 03:41:32.619639008 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W502 03:41:35.498220978 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W502 03:41:37.301887996 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W502 03:41:39.660384882 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W502 03:41:40.236040369 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W502 03:41:50.091043974 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W502 03:41:51.060457756 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W502 03:41:58.333075432 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W502 03:41:58.633540757 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W502 03:42:13.143552326 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W502 03:42:25.585215754 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W502 03:42:40.818578380 socket.cpp:200
`

The command to launch accelerate is:

`

!accelerate launch
–multi_gpu
–num_processes 2
–num_machines 1
–mixed_precision bf16
train_grpo.py

`

while the beginning of the train_grpo.py is:

`
os.environ[“NCCL_SOCKET_IFNAME”] = “127.0.0.1”
os.environ[“GLOO_SOCKET_IFNAME”] = “127.0.0.1”
os.environ[“MASTER_ADDR”] = “127.0.0.1”
os.environ[‘HOSTNAME’] = ‘127.0.0.1’

if “MASTER_PORT” not in os.environ:
os.environ[“MASTER_PORT”] = “29505”

os.environ[“NCCL_P2P_DISABLE”] = “1”
os.environ[“NCCL_IB_DISABLE”] = “1”

`

I tried following solutions, but all fail:

  1. confirm the network is enabled in kaggle notebook.
  2. replace the IP address with localhost as well as lo and remove above environment setting
  3. upgrade torch from 2.7.0 to 2.10.0

thanks for your suggestion

1 Like