When trying to do multiprocessing in windows I get the following error.
Traceback (most recent call last):
File "train.py", line 159, in <module>
main()
File "train.py", line 111, in main
for i_batch, sample_batched in enumerate(Dataloader_Train):
File "C:\Users\TestUser\Anaconda3\envs\py3_pt3_orig\lib\site-packages\torch\utils\data\dataloader.py", line 451, in __iter__
return _DataLoaderIter(self)
File "C:\Users\TestUser\Anaconda3\envs\py3_pt3_orig\lib\site-packages\torch\utils\data\dataloader.py", line 239, in __init__
w.start()
File "C:\Users\TestUser\Anaconda3\envs\py3_pt3_orig\lib\multiprocessing\process.py", line 105, in start
self._popen = self._Popen(self)
File "C:\Users\TestUser\Anaconda3\envs\py3_pt3_orig\lib\multiprocessing\context.py", line 212, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "C:\Users\TestUser\Anaconda3\envs\py3_pt3_orig\lib\multiprocessing\context.py", line 313, in _Popen
return Popen(process_obj)
File "C:\Users\TestUser\Anaconda3\envs\py3_pt3_orig\lib\multiprocessing\popen_spawn_win32.py", line 66, in __init__
reduction.dump(process_obj, to_child)
File "C:\Users\TestUser\Anaconda3\envs\py3_pt3_orig\lib\multiprocessing\reduction.py", line 59, in dump
ForkingPickler(file, protocol).dump(obj)
OverflowError: cannot serialize a bytes object larger than 4 GiB
(py3_pt3_orig) C:\Users\TestUser\Documents\eliaseulig\DeepDSA>Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Users\TestUser\Anaconda3\envs\py3_pt3_orig\lib\multiprocessing\spawn.py", line 106, in spawn_main
exitcode = _main(fd)
File "C:\Users\TestUser\Anaconda3\envs\py3_pt3_orig\lib\multiprocessing\spawn.py", line 116, in _main
self = pickle.load(from_parent)
EOFError: Ran out of input
I wrapped the whole code (except package loading) into a main() function. It works perfectly fine when setting num_workers=0, but due to heavy data augmentation this slows down the training and does not let my GPU use its full potential.
What is your batch size and how large is each sample?
I would like to reproduce the error as I’m wondering why a single worker seems to be able to load the data while multiple workers crash.
Plus: The related python bug: link
However, according to this issue, this one can be solved by using pickle version 4. But it cannot be controlled on our side. It’s actually a Python bug. As the workground, we could implement something like this that overrides the default pickle version.
Thanks for the reply and sorry for the late answer. Unfortunately neither reducing the batch size, nor reducing the input size (by just making the images smaller) helps with the problem. How can two 256x256 tensors exceed 4gb of ram? I also tried changing the DEFAULT_PROTOCOL variable in torch.serialization.py to 4 (instead of 2) which seems to manage the pickle version used but without any effect.
Are you seeing this issue only using your data loading or are you seeing similar issues using random data?
Are you using the if-clause protection as explained here?
You could add it, but it shouldn’t be a problem. Sorry, I confused your setup with the title of the topic.
Is changing the pickle version or the protocol not working for you?
I’m seeing same problem. Running windows, python 3.7.7, latest pytorch version. Issue seen only with num_workers>0. I tried modifying Anaconda3\envs\test\Lib\multiprocessing\reduction.py with def dump(obj, file, protocol=4, but it didn’t help. Is there a different way to force protocol version or any other workaround. I’m running the code in https://github.com/seungwonpark/melgan.git.
I also meet the problem after I set protocol=4. Could you share your code to work around the problem? Thank you!
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "/opt/conda/lib/python3.7/multiprocessing/spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
Traceback (most recent call last):
File "/opt/conda/bin/fairseq-train", line 11, in <module>
load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()
File "/code/fairseq/fairseq_cli/train.py", line 370, in cli_main
nprocs=args.distributed_world_size,
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 107, in join
(error_index, name)
Exception: process 5 terminated with signal SIGKILL
I am running Pytorch Windows version and encounter the same EOFError that is induced by num_worker > 0. It works when num_worker is exactly 0. I tried @MLAI’s solution, changing the protocol=4 in ForkingPickler class , but it doesn’t work on multiple workers scenario.
I wonder @MLAI have you tried running on multiple workers?
I was having a similar problem in pycharm & spyder, but after rebuilding the Virtual enviroment it is now working in Spyder, just waiting for the save point and then i’ll try in pycharm.
not sure if this helps, but it maybe worth a shot.