Multiprocessing issue

Hi,

I am running DDP for my training task. I observe that occasionally a python exception is raised:

Traceback (most recent call last):
  File ".../python3.7/multiprocessing/queues.py", line 242, in _feed
    send_bytes(obj)

And I observe that this always occurs at the end of a for loop (e.g., training for loop or evaluation for loop). Moreoever, this exception does not cause the process to terminate.

May I know what is the root cause of the issue? Does it affect my training?

1 Like

Could you share the complete traceback? Its not clear what the error might be from the traceback you have provided.

It’s literally the complete traceback:

2021-07-02 01:58:40,520 - some log
Traceback (most recent call last):
  File ".../lib/python3.7/multiprocessing/queues.py", line 242, in _feed
    send_bytes(obj)
2021-07-02 01:58:40,796 - another log

As you can see, the process does not terminate.

However, sometimes I see the “full” traceback as follows:

Traceback (most recent call last):
  File "/usr/lib/python3.7/multiprocessing/queues.py", line 242, in _feed
    send_bytes(obj)
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 404, in _send_bytes
    self._send(header + buf)
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

This does not cause the process to terminate either.

Thanks for providing additional information, although I’m not sure what could be the reason for this traceback. The weird thing is the traceback doesn’t have any application code/ DDP code indicating where this is coming from.

Is it possible to share minimal repro script to see if we I can repro it on my end?

Hi @hnt4499 did you find the cause or any solution?

Hi @fermat97 ,

No, I didn’t. Since it doesn’t affect my task, I didn’t bother trying to fix it.

1 Like

@hnt4499 got exactly the same error with you. Can some one help with this issue?

Hi @Hao_Fu1 , do you have a repro of the issue?