Python multiprocessing struct.error: 'i' format requires -2147483648 <= number <= 2147483647

Pytorch_Text_User · October 31, 2019, 1:54pm

I have a huge text data and at certain amount of data (usually more than 700MB), I keep getting this message error.

Cpython…
Python multiprocessing struct.error: ‘i’ format requires -2147483648 <= number <= 2147483647

But I don’t use any of that and it’s probably because of pytorch features such as Dataset and so on.

Any idea how to solve this?

tom · October 31, 2019, 2:16pm

Yeah, well, usually you should get a backtrace that helps pinpointing where that might be happening.

Best regards

Thomas

Pytorch_Text_User · November 1, 2019, 1:21am

Thanks for response, message is like this.

File “/app/train.py”, line 48, in train
for dataset in train_data_loader:
File “/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 631, in next
idx, batch = self._get_batch()
File “/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 610, in _get_batch
return self.data_queue.get()
File “/opt/conda/lib/python3.6/multiprocessing/queues.py”, line 94, in get
res = self._recv_bytes()
File “/opt/conda/lib/python3.6/multiprocessing/connection.py”, line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File “/opt/conda/lib/python3.6/multiprocessing/connection.py”, line 407, in _recv_bytes
buf = self._recv(4)
File “/opt/conda/lib/python3.6/multiprocessing/connection.py”, line 379, in _recv
chunk = read(handle, remaining)
File “/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 274, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 92) is killed by signal: Killed.
User session exited

Best regards
Ryan

tom · November 1, 2019, 3:41am

That isn’t the backtrace or error message for the initial error but one that would appear to be the one on the other end after the initial error happened or so.

At any rate, the original message seems to be a bug of Python < 3.8 handling large objects badly when it communicates between the processing:

github.com

python/cpython/blob/master/Lib/multiprocessing/connection.py#L392


            if remaining == size:
                raise EOFError
            else:
                raise OSError("got end of file during message")
        buf.write(chunk)
        remaining -= n
    return buf


def _send_bytes(self, buf):
    n = len(buf)
    if n > 0x7fffffff:
        pre_header = struct.pack("!i", -1)
        header = struct.pack("!Q", n)
        self._send(pre_header)
        self._send(header)
        self._send(buf)
    else:
        # For wire compatibility with 3.7 and lower
        header = struct.pack("!i", n)
        if n > 16384:
            # The payload is large so Nagle's algorithm won't be triggered

Of course, that you do things send that large, chances are that something is amiss with what your program does for multiprocessing. Is something (i.e. another library) keeping you from using PyTorch’s multiprocessing wrapper?

Best regards

Thomas

Pytorch_Text_User · November 1, 2019, 4:58am

I think I found issues based on your comment.

from torch.utils.data import DataLoader

train_data_loader = DataLoader(train_data_meta,
shuffle=False, collate_fn=lambda x: x[0],
num_workers=config.Req[“num_workers”], drop_last=True)

Yes, it’s about multiprocessing and if num_workers > 1 and data is large, the program is killed or error occurs because of that.

I use python 3.5~3.6 and I will probably think to change my python version higher.