ConnectionResetError: [Errno 104] Connection reset by peer

luoxiaoyu · January 31, 2023, 3:44pm

Hi, guys.
I’m reproducing a paper’s code based on habitat-simulator & pytorch.
It’s work good in my local env(ubuntu 20.04), but when I transfer the whole env into the nvidia-docker, it’s always shown that ConnectionResetError: [Errno 104] Connection reset by peer.
here is the error.

File “/PointNav-VO/pointnav_vo/run.py”, line 346, in
main()
File “/PointNav-VO/pointnav_vo/run.py”, line 74, in main
run_exp(**vars(args))
File “/PointNav-VO/pointnav_vo/run.py”, line 340, in run_exp
trainer.eval()
File “/PointNav-VO/pointnav_vo/rl/common/base_trainer.py”, line 125, in eval
checkpoint_index=ckpt_id,
File “/PointNav-VO/pointnav_vo/rl/ppo/ppo_trainer.py”, line 530, in _eval_checkpoint
self.envs = construct_envs(config, get_env_class(config.ENV_NAME))
File “/PointNav-VO/pointnav_vo/rl/common/env_utils.py”, line 97, in construct_envs
make_env_fn=make_env_fn, env_fn_args=tuple(tuple(zip(configs, env_classes))),
File “/habitat-lab/habitat/core/vector_env.py”, line 140, in init
read_fn() for read_fn in self._connection_read_fns
File “/habitat-lab/habitat/core/vector_env.py”, line 140, in
read_fn() for read_fn in self._connection_read_fns
File “/opt/conda/envs/pointnav-vo/lib/python3.7/multiprocessing/connection.py”, line 250, in recv
buf = self._recv_bytes()
File “/opt/conda/envs/pointnav-vo/lib/python3.7/multiprocessing/connection.py”, line 407, in _recv_bytes
buf = self._recv(4)
File “/opt/conda/envs/pointnav-vo/lib/python3.7/multiprocessing/connection.py”, line 379, in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
Exception ignored in: <function VectorEnv.del at 0x7fec2fdda710>
Traceback (most recent call last):
File “/habitat-lab/habitat/core/vector_env.py”, line 518, in del
self.close()
File “/habitat-lab/habitat/core/vector_env.py”, line 400, in close
write_fn((CLOSE_COMMAND, None))
File “/opt/conda/envs/pointnav-vo/lib/python3.7/multiprocessing/connection.py”, line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File “/opt/conda/envs/pointnav-vo/lib/python3.7/multiprocessing/connection.py”, line 404, in _send_bytes
self._send(header + buf)
File “/opt/conda/envs/pointnav-vo/lib/python3.7/multiprocessing/connection.py”, line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

ptrblck · February 1, 2023, 6:21am

Are you seeing any errors when trying to run the code with num_workers=0?
Maybe the actual data loading fails and the workers just reraise it as an unrelated crash?

luoxiaoyu · February 1, 2023, 10:50am

After set it with num_workers=0, it still not work.
My dataset is connected to the docker in the form of a mount, do you think this will be the problem?

ptrblck · February 1, 2023, 10:56am

What kind of error are you seeing when you are using num_workers=0?

luoxiaoyu · February 1, 2023, 11:01am

Totally same as before.

ptrblck · February 1, 2023, 7:33pm

In that case check where multiprocessing is used.
Using num_workers>0 will use the multiprocessing module to create workers. If you are using num_workers=0 the data loading and processing will be performed in the main process and PyTorch will not try to create other processes unless you are explicitly creating them or another library.