I encountered this issue when running a training using accelerate in kaggle notebook.
W502 03:41:26.169817470 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W502 03:41:26.250752402 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W502 03:41:26.771717561 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W502 03:41:27.082920912 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W502 03:41:27.839936719 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W502 03:41:28.904583608 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W502 03:41:29.049047254 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W502 03:41:30.348198113 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W502 03:41:30.418451977 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W502 03:41:32.619639008 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W502 03:41:35.498220978 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W502 03:41:37.301887996 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W502 03:41:39.660384882 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W502 03:41:40.236040369 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W502 03:41:50.091043974 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W502 03:41:51.060457756 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W502 03:41:58.333075432 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W502 03:41:58.633540757 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W502 03:42:13.143552326 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W502 03:42:25.585215754 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W502 03:42:40.818578380 socket.cpp:200
`
The command to launch accelerate is:
`
!accelerate launch
–multi_gpu
–num_processes 2
–num_machines 1
–mixed_precision bf16
train_grpo.py
`
while the beginning of the train_grpo.py is:
`
os.environ[“NCCL_SOCKET_IFNAME”] = “127.0.0.1”
os.environ[“GLOO_SOCKET_IFNAME”] = “127.0.0.1”
os.environ[“MASTER_ADDR”] = “127.0.0.1”
os.environ[‘HOSTNAME’] = ‘127.0.0.1’
if “MASTER_PORT” not in os.environ:
os.environ[“MASTER_PORT”] = “29505”
os.environ[“NCCL_P2P_DISABLE”] = “1”
os.environ[“NCCL_IB_DISABLE”] = “1”
`
I tried following solutions, but all fail:
- confirm the network is enabled in kaggle notebook.
- replace the IP address with localhost as well as lo and remove above environment setting
- upgrade torch from 2.7.0 to 2.10.0
thanks for your suggestion