Segmentation fault in dataloader

ray342659093 · November 30, 2018, 12:14am

when I run my program on my university HPC with pytorch 1.0 and CentOS 6.9 , I got the following message:

2018-11-30 10:40:09,381 - INFO - Distributed training: False
2018-11-30 10:40:11,842 - INFO - load model from: modelzoo://resnet50
2018-11-30 10:40:16,534 - INFO - load checkpoint from pretrain_stage2.pth
2018-11-30 10:40:18,179 - INFO - Start running, host: rusu5516@hpc220, work_dir: /project/RDS-FEI-cvpr19-RW/mmdetection/work_dirs/faster_rcnn_r50_fpn_1x_full_data_stage2_scratch_2x
2018-11-30 10:40:18,179 - INFO - workflow: [(‘train’, 1)], max: 12 epochs
/var/spool/PBS/mom_priv/jobs/2624847.pbsserver.SC: line 40: 215659 Segmentation fault (core dumped) python tools/train.py configs/faster_rcnn_r50_fpn_1x_ucf24.py --gpus 1

I can successfully run the same program on my local computer with pytorch 0.4.1 and ubuntu 16.04, Then I tried to print out message to locate the line in my code that created this issue. I found that the segmentation fault was raised then executing multiprocessing in _DataLoaderIter method of dataloader code.

            for i in range(self.num_workers):
                print(54.1)
                index_queue = multiprocessing.Queue()
                print(54.2)
                index_queue.cancel_join_thread()
                print(54.3)
                w = multiprocessing.Process(
                    target=_worker_loop,
                    args=(self.dataset, index_queue,
                          self.worker_result_queue, self.done_event,
                          self.collate_fn, base_seed + i,
                          self.worker_init_fn, i))
                print(54.4)
                w.daemon = True
                print(54.5)
                # NB: Process.start() actually take some time as it needs to
                #     start a process and pass the arguments over via a pipe.
                #     Therefore, we only add a worker to self.workers list after
                #     it started, so that we do not call .join() if program dies
                #     before it starts, and __del__ tries to join but will get:
                #     AssertionError: can only join a started process.
                w.start()
                print(54.6)
                self.index_queues.append(index_queue)
                print(54.7)
                self.workers.append(w)

The program printed up to 54.5 and did not print 54.6. So I think the program crashed when executing the line w.start()

Does anyone have an idea of what is happening and how to fix it?