DataLoader Multiprocessing error: can't pickle odict_keys objects when num_workers > 0

zwacke · April 29, 2019, 5:16pm

I’m using windows10 64-bit, python 3.7.3 in Jupyter Notebook(anaconda) environment, intel i9-7980XE:

When I try to enumerate over the DataLoader() object with num_workers > 0 like:

> if __name__=='__main__':
>     ...    
>     DL = DataLoader(data, batch_size=8, shuffle=True, num_workers=8)
>     for i in enumerate(DL ):
>         print(i)

I get the following error (also when using next(iter(DL )):

TypeError: can’t pickle odict_keys objects

Full Error:

TypeError                                 Traceback (most recent call last)
<ipython-input-19-ebb16a21aa89> in <module>
  ...
----> 6     for i in enumerate(DL ):
...

C:\ProgramData\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in __iter__(self)
    817 
    818     def __iter__(self):
--> 819         return _DataLoaderIter(self)
    820 
    821     def __len__(self):

C:\ProgramData\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py in __init__(self, loader)
    558                 #     before it starts, and __del__ tries to join but will get:
    559                 #     AssertionError: can only join a started process.
--> 560                 w.start()
    561                 self.index_queues.append(index_queue)
    562                 self.workers.append(w)

C:\ProgramData\Anaconda3\lib\multiprocessing\process.py in start(self)
    110                'daemonic processes are not allowed to have children'
    111         _cleanup()
--> 112         self._popen = self._Popen(self)
    113         self._sentinel = self._popen.sentinel
    114         # Avoid a refcycle if the target function holds an indirect

C:\ProgramData\Anaconda3\lib\multiprocessing\context.py in _Popen(process_obj)
    221     @staticmethod
    222     def _Popen(process_obj):
--> 223         return _default_context.get_context().Process._Popen(process_obj)
    224 
    225 class DefaultContext(BaseContext):

C:\ProgramData\Anaconda3\lib\multiprocessing\context.py in _Popen(process_obj)
    320         def _Popen(process_obj):
    321             from .popen_spawn_win32 import Popen
--> 322             return Popen(process_obj)
    323 
    324     class SpawnContext(BaseContext):

C:\ProgramData\Anaconda3\lib\multiprocessing\popen_spawn_win32.py in __init__(self, process_obj)
     87             try:
     88                 reduction.dump(prep_data, to_child)
---> 89                 reduction.dump(process_obj, to_child)
     90             finally:
     91                 set_spawning_popen(None)

C:\ProgramData\Anaconda3\lib\multiprocessing\reduction.py in dump(obj, file, protocol)
     58 def dump(obj, file, protocol=None):
     59     '''Replacement for pickle.dump() using ForkingPickler.'''
---> 60     ForkingPickler(file, protocol).dump(obj)
     61 
     62 #

TypeError: can't pickle odict_keys objects

peterjc123 · April 30, 2019, 8:11am

Well, you could not use multiprocessing in interactive shell. Please try using them in normal shell.

zwacke · April 30, 2019, 12:06pm

I tried running in a regular .py script from powershell, resulting in the same error, just with additional extension:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\ProgramData\Anaconda3\lib\multiprocessing\spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "C:\ProgramData\Anaconda3\lib\multiprocessing\spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
EOFError: Ran out of input

zwacke · April 30, 2019, 2:19pm

I eventually solved my problem and i’ll leave the solution here so hopefully someone else will be spared the pain.

It had nothing to do with python version or interactive shells. I tried different environments, none made it work. The error was related to pickling/dictionaries/windows/python.

My pytorch data(torch.utils.data.Dataset) object was abstracting a classification dataset gained from an xml. The pipeline was .xml to data_dict = dict{‘classnames’: dict{‘example 1’: img_path, …} …} to pytorch dataset with list of all elements. In the pytorch data class I gathered all classnames to access as attribute by:

self.classes = data_dict.keys()

which caused the error because the data_dict.keys() was only a shallow copy of the pointer towards the keys listed in the class where I use ElementTree to extract the dict out of the .xml ! I could resolve the issue by assigning seperate memory:

self.classes = list(data_dict.keys())

Note, that dicts and odicts are not in general troublesome. assigning a dict as attribute did not cause the error.

lander1003 · November 28, 2019, 12:54pm

can you explain more clearly,I am new in pytorch, and I meet the same questions with you, I have no idea where to add the sentence you have said"self.classes=list(data_dict.keys())",thank you

zwacke · November 28, 2019, 1:25pm

My problem was that my class assignment was merely a pointer to a pointer that pointed into a file, i.e. the values for my classes were directly read from the disk memory. That caused the pickling to fail. In python, often a variable assignment from another variable is only a new pointer to the same memory, i.e. no new memory is assigned. In my case I had to wrap the iterable dict.keys() in a list. While both return exactly the same values, the list() constructor assigns new memory in RAM.

It might be that you have a similar problem in your pipeline if you read from a csv, xml, json or whatever. Make sure that in your code at one point you make a deep copy of whatever values you read in so that the variables for pickling do not point into the hard disk memory space but in RAM.

lander1003 · November 28, 2019, 3:01pm

Do you have the correct code in github?

111194 · November 29, 2019, 5:37am

excuse me , I meet the same questions with you , but because I’m new in python and pytorch, I really can’t understand how to solve it , can you help me?

111194 · November 29, 2019, 5:57am

my code in github is https://github.com/WuJie1010/Facial-Expression-Recognition.Pytorch, when I run mainpro_FER I meet the same question with you , I’d really appreciate it if you can help me, Thank you !!!

zwacke · November 29, 2019, 12:40pm

your problem could be in your file fer.py, in these lines:

 self.data = h5py.File('./data/data.h5', 'r', driver='core')
 if self.split == 'Training':
            self.train_data = self.data['Training_pixel']
            self.train_labels = self.data['Training_label']

self.train_data points into the .h5 file. Try wrapping it with a list constructor, or whatever data type is appropriate, to create a deep copy instead of a shallow reference:

 self.data = h5py.File('./data/data.h5', 'r', driver='core')
 if self.split == 'Training':
            self.train_data = list(self.data['Training_pixel'])
            self.train_labels = list(self.data['Training_label'])

Apply the same for your other datasets.

111194 · November 29, 2019, 1:53pm

okok , thank you very much ,I’m going to have a try,thank you thank you

111194 · November 29, 2019, 3:12pm

But it still doesn’t work

mars · March 4, 2020, 9:45pm

did you solve the problem?

yptheangel · March 28, 2020, 2:00pm

use linux instead? if using windows, num_workers=0

Howl · January 25, 2021, 8:42am

In my case, error said ‘h5py object cannnot be pickled’. And I think the reason is that python will copy all member variable of class when use multiprocess, while h5py object cannot be copy for different process.
So if you set num_workers=0, it will avoid this error. And my solution is avoiding use h5py object as a member variable.

By the way, the true reason of using self.classes = list(data_dict.keys()) maybe that the type of data_dict.keys() in python3 is <class 'dict_keys'>, which cannot be pickled either.

ctpn · March 1, 2021, 2:16am

This worked for me, thanks!