I am using torch.distributed to launch and distributed training task. I am also trying to use “num_workers > 1” to optimize the training speed.
However, I get the following “Segmentation Fault” error whenever I using “num_workers > 1” …
Error: DataLoader worker (pid(s) 17423) exited unexpectedly
Traceback (most recent call last):
File “/miniconda/envs/iris/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 990, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File “/miniconda/envs/iris/lib/python3.6/multiprocessing/queues.py”, line 104, in get
if not self._poll(timeout):
File “/miniconda/envs/iris/lib/python3.6/multiprocessing/connection.py”, line 257, in poll
return self._poll(timeout)
File “/miniconda/envs/iris/lib/python3.6/multiprocessing/connection.py”, line 414, in _poll
r = wait([self], timeout)
File “/miniconda/envs/iris/lib/python3.6/multiprocessing/connection.py”, line 911, in wait
ready = selector.select(timeout)
File “/miniconda/envs/iris/lib/python3.6/selectors.py”, line 376, in select
fd_event_list = self._poll.poll(timeout)
File “/miniconda/envs/iris/lib/python3.6/site-packages/torch/utils/data/_utils/signal_handling.py”, line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 17423) is killed by signal: Segmentation fault.
I tried to analyze the root-cause of this error … in my case this error comes in the following scenario:
Whenever I send the metrics (using tensorboard or equivalent) … I get the above segmentation fault, once the training resumes …
I use a pre-condition to check if its the master-worker and send the metrics only using the master-process … once the training resumes I get the SegFault Error … Is there a way to prevent the DataLoader to use master worker for dataloading … or is there a different way in which we can address this issue … Please let me know …
The Error is also reported sometimes in a different was as below:
File “dist_pytorch/train_simple.py”, line 112, in train
for data in iter(dataloader):
File “/miniconda/envs/iris/lib/python3.7/site-packages/torch/utils/data/dataloader.py”, line 517, in next
data = self._next_data()
File “/miniconda/envs/iris/lib/python3.7/site-packages/torch/utils/data/dataloader.py”, line 1182, in _next_data
idx, data = self._get_data()
File “/miniconda/envs/iris/lib/python3.7/site-packages/torch/utils/data/dataloader.py”, line 1148, in _get_data
success, data = self._try_get_data()
File “/miniconda/envs/iris/lib/python3.7/site-packages/torch/utils/data/dataloader.py”, line 986, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File “/miniconda/envs/iris/lib/python3.7/multiprocessing/queues.py”, line 113, in get
return _ForkingPickler.loads(res)
File “/miniconda/envs/iris/lib/python3.7/site-packages/torch/multiprocessing/reductions.py”, line 282, in rebuild_storage_fd
fd = df.detach()
File “/miniconda/envs/iris/lib/python3.7/multiprocessing/resource_sharer.py”, line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
File “/miniconda/envs/iris/lib/python3.7/multiprocessing/resource_sharer.py”, line 87, in get_connection
c = Client(address, authkey=process.current_process().authkey)
File “/miniconda/envs/iris/lib/python3.7/multiprocessing/connection.py”, line 498, in Client
answer_challenge(c, authkey)
File “/miniconda/envs/iris/lib/python3.7/multiprocessing/connection.py”, line 742, in answer_challenge
message = connection.recv_bytes(256) # reject large message
File “/miniconda/envs/iris/lib/python3.7/multiprocessing/connection.py”, line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File “/miniconda/envs/iris/lib/python3.7/multiprocessing/connection.py”, line 407, in _recv_bytes
buf = self._recv(4)
File “/miniconda/envs/iris/lib/python3.7/multiprocessing/connection.py”, line 379, in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer