Segmentation fault (core dumped) in torch1.5.1

eja · August 17, 2021, 11:08am

I tried to train my model and it worked a few days before. However, when I tried to train again now I received this error. i have not changed any codes from the last one.

segmentation fault (core dumped)

then I did the faulthandler and it showed this:

Fatal Python error: Segmentation fault

Current thread 0x00007f6ac0982740 (most recent call first):
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/comm.py", line 39 in broadcast_coalesced
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/_functions.py", line 21 in forward
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/replicate.py", line 72 in _broadcast_coalesced_reshape
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/replicate.py", line 89 in replicate
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 159 in replicate
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 154 in forward
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550 in __call__
  File "train_2.py", line 79 in run
  File "train_2.py", line 120 in <module>
Segmentation fault (core dumped)

the line 81 in train.py is

predicts = net(image)

line 122-123 is

if __name__ == '__main__':
run()

i already reinstalled the python3.6.9 and it still gave me the same error.

this is my dataparallel code:

cuda_available = torch.cuda.is_available()
device_ids = [0,1] #number of gpu available
torch.cuda.set_device(device_ids[0])

if cuda_available:
    net = net.cuda()
    net = nn.DataParallel(net, device_ids=device_ids)

ptrblck · August 17, 2021, 7:12pm

Could you update to the latest stable release (1.9.0) or the nightly binary and rerun your script, please?

eja · August 18, 2021, 6:15am

i tried torch 1.9.0 before but it gave me nccl error.