when I run my program on my university HPC with pytorch 1.0 and CentOS 6.9 , I got the following message:
2018-11-30 10:40:09,381 - INFO - Distributed training: False
2018-11-30 10:40:11,842 - INFO - load model from: modelzoo://resnet50
2018-11-30 10:40:16,534 - INFO - load checkpoint from pretrain_stage2.pth
2018-11-30 10:40:18,179 - INFO - Start running, host: rusu5516@hpc220, work_dir: /project/RDS-FEI-cvpr19-RW/mmdetection/work_dirs/faster_rcnn_r50_fpn_1x_full_data_stage2_scratch_2x
2018-11-30 10:40:18,179 - INFO - workflow: [(‘train’, 1)], max: 12 epochs
/var/spool/PBS/mom_priv/jobs/2624847.pbsserver.SC: line 40: 215659 Segmentation fault (core dumped) python tools/train.py configs/faster_rcnn_r50_fpn_1x_ucf24.py --gpus 1
I can successfully run the same program on my local computer with pytorch 0.4.1 and ubuntu 16.04, Then I tried to print out message to locate the line in my code that created this issue. I found that the segmentation fault was raised then executing multiprocessing in _DataLoaderIter method of dataloader code.
for i in range(self.num_workers):
print(54.1)
index_queue = multiprocessing.Queue()
print(54.2)
index_queue.cancel_join_thread()
print(54.3)
w = multiprocessing.Process(
target=_worker_loop,
args=(self.dataset, index_queue,
self.worker_result_queue, self.done_event,
self.collate_fn, base_seed + i,
self.worker_init_fn, i))
print(54.4)
w.daemon = True
print(54.5)
# NB: Process.start() actually take some time as it needs to
# start a process and pass the arguments over via a pipe.
# Therefore, we only add a worker to self.workers list after
# it started, so that we do not call .join() if program dies
# before it starts, and __del__ tries to join but will get:
# AssertionError: can only join a started process.
w.start()
print(54.6)
self.index_queues.append(index_queue)
print(54.7)
self.workers.append(w)
The program printed up to 54.5 and did not print 54.6. So I think the program crashed when executing the line w.start()
Does anyone have an idea of what is happening and how to fix it?