Pytorch Windows EOFError: Ran out of input when num_workers>0

Thank you! Batch size is 32 and each sample is a dict containing two torch.FloatTensors of size (1,256,256) each.

The actually error is OverflowError: cannot serialize a bytes object larger than 4 GiB. You have to reduce the size of the input.

Plus: The related python bug: link
However, according to this issue, this one can be solved by using pickle version 4. But it cannot be controlled on our side. It’s actually a Python bug. As the workground, we could implement something like this that overrides the default pickle version.

2 Likes

Thanks for the reply and sorry for the late answer. Unfortunately neither reducing the batch size, nor reducing the input size (by just making the images smaller) helps with the problem. How can two 256x256 tensors exceed 4gb of ram? I also tried changing the DEFAULT_PROTOCOL variable in torch.serialization.py to 4 (instead of 2) which seems to manage the pickle version used but without any effect.

I solved the issue by changing protocol=None in pythons multiprocessing/reduction.py to 4.

class ForkingPickler(pickle.Pickler):
    '''Pickler subclass used by multiprocessing.'''
    _extra_reducers = {}
    _copyreg_dispatch_table = copyreg.dispatch_table

    def __init__(self, *args):
        super().__init__(*args)
        self.dispatch_table = self._copyreg_dispatch_table.copy()
        self.dispatch_table.update(self._extra_reducers)

    @classmethod
    def register(cls, type, reduce):
        '''Register a reduce function for a type.'''
        cls._extra_reducers[type] = reduce

    @classmethod
    def dumps(cls, obj, protocol=4):
        buf = io.BytesIO()
        cls(buf, protocol).dump(obj)
        return buf.getbuffer()

    loads = pickle.loads

register = ForkingPickler.register

def dump(obj, file, protocol=4):
    '''Replacement for pickle.dump() using ForkingPickler.'''
    ForkingPickler(file, protocol).dump(obj)

Thank you for your help!

I got the same problem with data loaded by pickle, is there any solution? thanks.

Are the suggestions in this thread not working?
If so, could you post the complete error message please?

Are you seeing this issue only using your data loading or are you seeing similar issues using random data?
Are you using the if-clause protection as explained here?

yes, I only saw this issue using my own data loading, my system is ubuntu 16.04, the if-clause protection work too?

You could add it, but it shouldn’t be a problem. Sorry, I confused your setup with the title of the topic.
Is changing the pickle version or the protocol not working for you?

I tried the method suggested in this post, but resulted in another error.

_pickle.UnpicklingError: pickle data was truncated

This error might point to a corrupt file.
Could you download it again or recreate it?

Ok, i will try, thanks

I solved the issue by loading the json in the class object, not in the main function.

1 Like

I’m seeing same problem. Running windows, python 3.7.7, latest pytorch version. Issue seen only with num_workers>0. I tried modifying Anaconda3\envs\test\Lib\multiprocessing\reduction.py with def dump(obj, file, protocol=4, but it didn’t help. Is there a different way to force protocol version or any other workaround. I’m running the code in https://github.com/seungwonpark/melgan.git.

Hi @mars,

I also meet the problem after I set protocol=4. Could you share your code to work around the problem? Thank you!

 Traceback (most recent call last):
   File "<string>", line 1, in <module>
   File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
     exitcode = _main(fd)
   File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 115, in _main
     self = reduction.pickle.load(from_parent)
 _pickle.UnpicklingError: pickle data was truncated
 Traceback (most recent call last):
   File "<string>", line 1, in <module>
   File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
     exitcode = _main(fd)
   File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 115, in _main
     self = reduction.pickle.load(from_parent)
 _pickle.UnpicklingError: pickle data was truncated
 Traceback (most recent call last):
   File "<string>", line 1, in <module>
   File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
     exitcode = _main(fd)
   File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 115, in _main
     self = reduction.pickle.load(from_parent)
 _pickle.UnpicklingError: pickle data was truncated
 Traceback (most recent call last):
   File "<string>", line 1, in <module>
   File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
     exitcode = _main(fd)
   File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 115, in _main
     self = reduction.pickle.load(from_parent)
 _pickle.UnpicklingError: pickle data was truncated
 Traceback (most recent call last):
   File "/opt/conda/bin/fairseq-train", line 11, in <module>
     load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()
   File "/code/fairseq/fairseq_cli/train.py", line 370, in cli_main
     nprocs=args.distributed_world_size,
   File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
     while not spawn_context.join():
   File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 107, in join
     (error_index, name)
 Exception: process 5 terminated with signal SIGKILL

Hello!

I am running Pytorch Windows version and encounter the same EOFError that is induced by num_worker > 0. It works when num_worker is exactly 0. I tried @MLAI’s solution, changing the protocol=4 in ForkingPickler class , but it doesn’t work on multiple workers scenario.

I wonder @MLAI have you tried running on multiple workers?

Thanks!

hi,

I was having a similar problem in pycharm & spyder, but after rebuilding the Virtual enviroment it is now working in Spyder, just waiting for the save point and then i’ll try in pycharm.

not sure if this helps, but it maybe worth a shot.

cheers,

chaslie

update:

runs in spyder IDE on windows 10, but pycharm 2021.2.1:

  File "C\anaconda3\envs\pycharm\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <class '__main__.dataset3D'>: attribute lookup dataset3D on __main__ failed
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "\anaconda3\envs\pycharm\lib\multiprocessing\spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "\anaconda3\envs\pycharm\lib\multiprocessing\spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
EOFError: Ran out of input

strangely enough it will run if pycharm is in debug mode…

Thank you so much, this worked for me.