Hi,
I try to get DistributedDataParallel for inference running. However, on my way doing that I stumbled across a problem which is largely independent from it. I run into a MemoryError as soon as I want to use numpy views:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/user_software/miniconda3_envs/tomotwin_pt2/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/mnt/data/twagner/Projects/TomoTwin/results/202208_YenT_step3/mem_simpel.py", line 54, in run
for batch in volume_loader:
File "/opt/user_software/miniconda3_envs/tomotwin_pt2/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 441, in __iter__
return self._get_iterator()
File "/opt/user_software/miniconda3_envs/tomotwin_pt2/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 388, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "/opt/user_software/miniconda3_envs/tomotwin_pt2/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1042, in __init__
w.start()
File "/opt/user_software/miniconda3_envs/tomotwin_pt2/lib/python3.10/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/opt/user_software/miniconda3_envs/tomotwin_pt2/lib/python3.10/multiprocessing/context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/opt/user_software/miniconda3_envs/tomotwin_pt2/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
return Popen(process_obj)
File "/opt/user_software/miniconda3_envs/tomotwin_pt2/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/opt/user_software/miniconda3_envs/tomotwin_pt2/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/opt/user_software/miniconda3_envs/tomotwin_pt2/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/opt/user_software/miniconda3_envs/tomotwin_pt2/lib/python3.10/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
MemoryError
I used the views before when using DataParallel without torch multiprocessing and it was working. Obviously, torch multiprocessing tries to serialize the data and crashes with memory problems.
Here is a code snippet to reproduce the problem:
As anyone aware of an elegant solution?
Best,
Thorsten