I am running DDP for my training task. I observe that occasionally a python exception is raised:
Traceback (most recent call last):
File ".../python3.7/multiprocessing/queues.py", line 242, in _feed
send_bytes(obj)
And I observe that this always occurs at the end of a for loop (e.g., training for loop or evaluation for loop). Moreoever, this exception does not cause the process to terminate.
May I know what is the root cause of the issue? Does it affect my training?
2021-07-02 01:58:40,520 - some log
Traceback (most recent call last):
File ".../lib/python3.7/multiprocessing/queues.py", line 242, in _feed
send_bytes(obj)
2021-07-02 01:58:40,796 - another log
As you can see, the process does not terminate.
However, sometimes I see the “full” traceback as follows:
Traceback (most recent call last):
File "/usr/lib/python3.7/multiprocessing/queues.py", line 242, in _feed
send_bytes(obj)
File "/usr/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/lib/python3.7/multiprocessing/connection.py", line 404, in _send_bytes
self._send(header + buf)
File "/usr/lib/python3.7/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
This does not cause the process to terminate either.
Thanks for providing additional information, although I’m not sure what could be the reason for this traceback. The weird thing is the traceback doesn’t have any application code/ DDP code indicating where this is coming from.
Is it possible to share minimal repro script to see if we I can repro it on my end?