Pytorch Windows EOFError: Ran out of input when num_workers>0

MLAI · September 25, 2018, 2:07pm

When trying to do multiprocessing in windows I get the following error.

Traceback (most recent call last):
  File "train.py", line 159, in <module>
    main()
  File "train.py", line 111, in main
    for i_batch, sample_batched in enumerate(Dataloader_Train):
  File "C:\Users\TestUser\Anaconda3\envs\py3_pt3_orig\lib\site-packages\torch\utils\data\dataloader.py", line 451, in __iter__
    return _DataLoaderIter(self)
  File "C:\Users\TestUser\Anaconda3\envs\py3_pt3_orig\lib\site-packages\torch\utils\data\dataloader.py", line 239, in __init__
    w.start()
  File "C:\Users\TestUser\Anaconda3\envs\py3_pt3_orig\lib\multiprocessing\process.py", line 105, in start
    self._popen = self._Popen(self)
  File "C:\Users\TestUser\Anaconda3\envs\py3_pt3_orig\lib\multiprocessing\context.py", line 212, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Users\TestUser\Anaconda3\envs\py3_pt3_orig\lib\multiprocessing\context.py", line 313, in _Popen
    return Popen(process_obj)
  File "C:\Users\TestUser\Anaconda3\envs\py3_pt3_orig\lib\multiprocessing\popen_spawn_win32.py", line 66, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Users\TestUser\Anaconda3\envs\py3_pt3_orig\lib\multiprocessing\reduction.py", line 59, in dump
    ForkingPickler(file, protocol).dump(obj)
OverflowError: cannot serialize a bytes object larger than 4 GiB

(py3_pt3_orig) C:\Users\TestUser\Documents\eliaseulig\DeepDSA>Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\TestUser\Anaconda3\envs\py3_pt3_orig\lib\multiprocessing\spawn.py", line 106, in spawn_main
    exitcode = _main(fd)
  File "C:\Users\TestUser\Anaconda3\envs\py3_pt3_orig\lib\multiprocessing\spawn.py", line 116, in _main
    self = pickle.load(from_parent)
EOFError: Ran out of input

I wrapped the whole code (except package loading) into a main() function. It works perfectly fine when setting num_workers=0, but due to heavy data augmentation this slows down the training and does not let my GPU use its full potential.

Thanks in advance.

ptrblck · September 25, 2018, 4:20pm

What is your batch size and how large is each sample?
I would like to reproduce the error as I’m wondering why a single worker seems to be able to load the data while multiple workers crash.

MLAI · September 25, 2018, 5:05pm

Thank you! Batch size is 32 and each sample is a dict containing two torch.FloatTensors of size (1,256,256) each.

peterjc123 · September 26, 2018, 12:58pm

The actually error is OverflowError: cannot serialize a bytes object larger than 4 GiB. You have to reduce the size of the input.

peterjc123 · September 26, 2018, 1:00pm

Plus: The related python bug: link
However, according to this issue, this one can be solved by using pickle version 4. But it cannot be controlled on our side. It’s actually a Python bug. As the workground, we could implement something like this that overrides the default pickle version.

MLAI · October 2, 2018, 9:58am

Thanks for the reply and sorry for the late answer. Unfortunately neither reducing the batch size, nor reducing the input size (by just making the images smaller) helps with the problem. How can two 256x256 tensors exceed 4gb of ram? I also tried changing the DEFAULT_PROTOCOL variable in torch.serialization.py to 4 (instead of 2) which seems to manage the pickle version used but without any effect.

MLAI · October 12, 2018, 12:26pm

I solved the issue by changing protocol=None in pythons multiprocessing/reduction.py to 4.

class ForkingPickler(pickle.Pickler):
    '''Pickler subclass used by multiprocessing.'''
    _extra_reducers = {}
    _copyreg_dispatch_table = copyreg.dispatch_table

    def __init__(self, *args):
        super().__init__(*args)
        self.dispatch_table = self._copyreg_dispatch_table.copy()
        self.dispatch_table.update(self._extra_reducers)

    @classmethod
    def register(cls, type, reduce):
        '''Register a reduce function for a type.'''
        cls._extra_reducers[type] = reduce

    @classmethod
    def dumps(cls, obj, protocol=4):
        buf = io.BytesIO()
        cls(buf, protocol).dump(obj)
        return buf.getbuffer()

    loads = pickle.loads

register = ForkingPickler.register

def dump(obj, file, protocol=4):
    '''Replacement for pickle.dump() using ForkingPickler.'''
    ForkingPickler(file, protocol).dump(obj)

Thank you for your help!

mars · March 4, 2020, 10:02pm

I got the same problem with data loaded by pickle, is there any solution? thanks.

ptrblck · March 5, 2020, 3:03am

Are the suggestions in this thread not working?
If so, could you post the complete error message please?

ptrblck · March 5, 2020, 5:17am

Are you seeing this issue only using your data loading or are you seeing similar issues using random data?
Are you using the if-clause protection as explained here?

mars · March 5, 2020, 5:23am

yes, I only saw this issue using my own data loading, my system is ubuntu 16.04, the if-clause protection work too?

ptrblck · March 5, 2020, 5:25am

You could add it, but it shouldn’t be a problem. Sorry, I confused your setup with the title of the topic.
Is changing the pickle version or the protocol not working for you?

mars · March 5, 2020, 5:29am

I tried the method suggested in this post, but resulted in another error.

_pickle.UnpicklingError: pickle data was truncated

ptrblck · March 5, 2020, 5:38am

This error might point to a corrupt file.
Could you download it again or recreate it?

mars · March 5, 2020, 5:39am

Ok, i will try, thanks

mars · March 5, 2020, 9:41am

I solved the issue by loading the json in the class object, not in the main function.

susinder · April 16, 2020, 1:37pm

I’m seeing same problem. Running windows, python 3.7.7, latest pytorch version. Issue seen only with num_workers>0. I tried modifying Anaconda3\envs\test\Lib\multiprocessing\reduction.py with def dump(obj, file, protocol=4, but it didn’t help. Is there a different way to force protocol version or any other workaround. I’m running the code in https://github.com/seungwonpark/melgan.git.

yuenn · May 29, 2020, 7:14am

Hi @mars,

I also meet the problem after I set protocol=4. Could you share your code to work around the problem? Thank you!

 Traceback (most recent call last):
   File "<string>", line 1, in <module>
   File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
     exitcode = _main(fd)
   File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 115, in _main
     self = reduction.pickle.load(from_parent)
 _pickle.UnpicklingError: pickle data was truncated
 Traceback (most recent call last):
   File "<string>", line 1, in <module>
   File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
     exitcode = _main(fd)
   File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 115, in _main
     self = reduction.pickle.load(from_parent)
 _pickle.UnpicklingError: pickle data was truncated
 Traceback (most recent call last):
   File "<string>", line 1, in <module>
   File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
     exitcode = _main(fd)
   File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 115, in _main
     self = reduction.pickle.load(from_parent)
 _pickle.UnpicklingError: pickle data was truncated
 Traceback (most recent call last):
   File "<string>", line 1, in <module>
   File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
     exitcode = _main(fd)
   File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 115, in _main
     self = reduction.pickle.load(from_parent)
 _pickle.UnpicklingError: pickle data was truncated
 Traceback (most recent call last):
   File "/opt/conda/bin/fairseq-train", line 11, in <module>
     load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()
   File "/code/fairseq/fairseq_cli/train.py", line 370, in cli_main
     nprocs=args.distributed_world_size,
   File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
     while not spawn_context.join():
   File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 107, in join
     (error_index, name)
 Exception: process 5 terminated with signal SIGKILL

meifish · October 13, 2020, 4:42pm

Hello!

I am running Pytorch Windows version and encounter the same EOFError that is induced by num_worker > 0. It works when num_worker is exactly 0. I tried @MLAI’s solution, changing the protocol=4 in ForkingPickler class , but it doesn’t work on multiple workers scenario.

I wonder @MLAI have you tried running on multiple workers?

Thanks!

chaslie · September 9, 2021, 10:42am

hi,

I was having a similar problem in pycharm & spyder, but after rebuilding the Virtual enviroment it is now working in Spyder, just waiting for the save point and then i’ll try in pycharm.

not sure if this helps, but it maybe worth a shot.

cheers,

chaslie