when I run my program on my university HPC with pytorch 1.0 and CentOS 6.9 , I got the following message:
2018-11-30 10:40:09,381 - INFO - Distributed training: False
2018-11-30 10:40:11,842 - INFO - load model from: modelzoo://resnet50
2018-11-30 10:40:16,534 - INFO - load checkpoint from pretrain_stage2.pth
2018-11-30 10:40:18,179 - INFO - Start running, host: rusu5516@hpc220, work_dir: /project/RDS-FEI-cvpr19-RW/mmdetection/work_dirs/faster_rcnn_r50_fpn_1x_full_data_stage2_scratch_2x
2018-11-30 10:40:18,179 - INFO - workflow: [(‘train’, 1)], max: 12 epochs
/var/spool/PBS/mom_priv/jobs/2624847.pbsserver.SC: line 40: 215659 Segmentation fault (core dumped) python tools/train.py configs/faster_rcnn_r50_fpn_1x_ucf24.py --gpus 1
I can successfully run the same program on my local computer with pytorch 0.4.1 and ubuntu 16.04, Then I tried to print out message to locate the line in my code that created this issue. I found that the segmentation fault was raised then executing multiprocessing in _DataLoaderIter method of dataloader code.
for i in range(self.num_workers): print(54.1) index_queue = multiprocessing.Queue() print(54.2) index_queue.cancel_join_thread() print(54.3) w = multiprocessing.Process( target=_worker_loop, args=(self.dataset, index_queue, self.worker_result_queue, self.done_event, self.collate_fn, base_seed + i, self.worker_init_fn, i)) print(54.4) w.daemon = True print(54.5) # NB: Process.start() actually take some time as it needs to # start a process and pass the arguments over via a pipe. # Therefore, we only add a worker to self.workers list after # it started, so that we do not call .join() if program dies # before it starts, and __del__ tries to join but will get: # AssertionError: can only join a started process. w.start() print(54.6) self.index_queues.append(index_queue) print(54.7) self.workers.append(w)
The program printed up to 54.5 and did not print 54.6. So I think the program crashed when executing the line w.start()
Does anyone have an idea of what is happening and how to fix it?