Aborted core dumped and other errors when training

Hello,
I forked and try to train the code from this repo.
I’m running on Ubuntu 18; RTX2080, Pytroch 1.6, Cuda 10.2

When I initiate the training command I get this huge error message, and I’ve no idea how to approach it.

/home/leeor/anaconda3/envs/deduce/lib/python3.6/site-packages/torchvision/transforms/transforms.py:752: UserWarning: The use of the transforms.RandomSizedCrop transform is deprecated, please use transforms.RandomResizedCrop instead.
  "please use transforms.RandomResizedCrop instead.")
/home/leeor/anaconda3/envs/deduce/lib/python3.6/site-packages/torchvision/transforms/transforms.py:257: UserWarning: The use of the transforms.Scale transform is deprecated, please use transforms.Resize instead.
  "please use transforms.Resize instead.")
train_deduce_scene_home.py:174: UserWarning: This overload of cuda is deprecated:
	cuda(torch.device device, bool async, *, torch.memory_format memory_format)
Consider using one of the following signatures instead:
	cuda(torch.device device, bool non_blocking, *, torch.memory_format memory_format) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:766.)
  target = target.cuda(async=True)
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [1,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [2,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [3,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [4,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [5,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [6,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [7,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [8,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [9,0,0] Assertion `t >= 0 && t < n_classes` failed.
Traceback (most recent call last):
  File "train_deduce_scene_home.py", line 296, in <module>
    main()
  File "train_deduce_scene_home.py", line 141, in main
    train(train_loader, model, criterion, optimizer, epoch)
  File "train_deduce_scene_home.py", line 183, in train
    losses.update(loss, input.size(0))
  File "train_deduce_scene_home.py", line 269, in update
    self.avg = self.sum / self.count
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f2e9c4cf1e2 in /home/leeor/anaconda3/envs/deduce/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7f2e9c71df92 in /home/leeor/anaconda3/envs/deduce/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f2e9c4bd9cd in /home/leeor/anaconda3/envs/deduce/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0xa4ed59 (0x7f2ed7f86d59 in /home/leeor/anaconda3/envs/deduce/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0x2d7b593 (0x7f2eda2b3593 in /home/leeor/anaconda3/envs/deduce/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0x3376132 (0x7f2eda8ae132 in /home/leeor/anaconda3/envs/deduce/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #6: torch::autograd::deleteNode(torch::autograd::Node*) + 0x7f (0x7f2eda8ae1df in /home/leeor/anaconda3/envs/deduce/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0x3ec959 (0x7f2ee7f01959 in /home/leeor/anaconda3/envs/deduce/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #8: c10::TensorImpl::release_resources() + 0x20 (0x7f2e9c4bd9a0 in /home/leeor/anaconda3/envs/deduce/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #9: <unknown function> + 0x540ae2 (0x7f2ee8055ae2 in /home/leeor/anaconda3/envs/deduce/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x540b86 (0x7f2ee8055b86 in /home/leeor/anaconda3/envs/deduce/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #22: __libc_start_main + 0xe7 (0x7f2eeab2db97 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

As can see above, I get many unknown functions and at the end the core dump. Any ideas how to go about it?
(torch.cuda.is_available() returns True, and which gcc return usr/bin/gcc)

Thank you:)

Hi,

THe import tant lines here are just before the stack trace:

/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.

Basically, in the nll class, you give something that is not between 0 and n_classes.
So I guess one of the labels you give to your criterion does not have a valid value.
Note that running the same code on CPU will give you a more user friendly error message.

1 Like

Hey,
thank you very much for the answer!
I’m not sure how to run it on the CPU only, but I’ll try to figure.
Also, I didn’t change the code, I just forked it and wanted first to see that I can run it, so I wonder why this is the case.

You might be using a different version of the dataset? Or they did some preprocessing on their dataset to remove these bad labels beforehand?

Hey,
you were right. I’ve talked to the owner of that repo and he helped me understand the issue.
The problem was that I download the entire Places365 data set, which has 365 labels. However, he used only 7 labels. So after removing the unnecessary labels the code runs!

Thank you for the help :slight_smile:

1 Like