Hi,
my code is constantly crashing/freezing at the same point in the loop. However, the crash always occurs at the same point in the code but not always at the same time/the same iteration – it appears to be quite random.
When running without Fil, the entire session crashes (no stack trace available – I only know from logging that it is approx. always at the same point). When running with Fil, it freezes and the stack trace always shows the same thing: the issue lies with the dataloader behavior when using multiple workers:
File "<own_codebase>", line 45, in _do_loss_eval
for idx, inputs in enumerate(self._data_loader):
File "/opt/conda/envs/_/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 633, in __next__
data = self._next_data()
File "/opt/conda/envs/_/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1328, in _next_data
idx, data = self._get_data()
File "/opt/conda/envs/_/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1294, in _get_data
success, data = self._try_get_data()
File "/opt/conda/envs/_/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1132, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/opt/conda/envs/_/lib/python3.9/multiprocessing/queues.py", line 113, in get
if not self._poll(timeout):
File "/opt/conda/envs/_/lib/python3.9/multiprocessing/connection.py", line 262, in poll
return self._poll(timeout)
File "/opt/conda/envs/_/lib/python3.9/multiprocessing/connection.py", line 429, in _poll
r = wait([self], timeout)
File "/opt/conda/envs/_/lib/python3.9/multiprocessing/connection.py", line 936, in wait
ready = selector.select(timeout)
File "/opt/conda/envs/_/lib/python3.9/selectors.py", line 416, in select
fd_event_list = self._selector.poll(timeout)
I’ve run it at least once with num_workers=0
to rule out some other bug hiding behind the multiproc issue – worked fine. Currently doing a second run to confirm (takes a while).
What could be the cause of this?
I haven’t yet figured out logging in subprocesses. Since the main process is getting stuck at polling – is it simply waiting for a subprocess that has crashed, and so I would have to look into the cause of that one crashing? Or is there any simpler lead, any param I could try out? (Would like to avoid getting into multiprocessing logs as it seems a total mess, even when logging to separate files, and especially because I do not explicitly create the single threads myself.)
I’m hoping that I’m missing something that someone with more experience would detect quickly. If not, then it’s back into the depths…
My dataloader config (running on 1 GPU):
- dataset: <detectron2.data.common.MapDataset object at 0x555560380340>
- num_workers: 2
- prefetch_factor: 2
- pin_memory: False
- pin_memory_device:
- timeout: 0
- worker_init_fn: None
- multiprocessing_context: None
- batch_size: 1
- drop_last: False
- sampler: <detectron2.data.samplers.distributed_sampler.InferenceSampler object at 0x5555602b49c0>
- batch_sampler: <torch.utils.data.sampler.BatchSampler object at 0x5555601d1b60>
- generator: None
- collate_fn :<function trivial_batch_collator at 0x55555fb3c1f0>
- persistent_workers: False
Of course I don’t want to rule out that the issue lies with the Detectron2 datatypes I am using for the dataset
and sampler
params. Just no idea how likely it is given the above stack trace. At the moment I simply don’t really know where to start.
Any advice is much appreciated!