Aborted core dumped and other errors when training

Leeor · August 21, 2020, 10:03pm

Hello,
I forked and try to train the code from this repo.
I’m running on Ubuntu 18; RTX2080, Pytroch 1.6, Cuda 10.2

When I initiate the training command I get this huge error message, and I’ve no idea how to approach it.

/home/leeor/anaconda3/envs/deduce/lib/python3.6/site-packages/torchvision/transforms/transforms.py:752: UserWarning: The use of the transforms.RandomSizedCrop transform is deprecated, please use transforms.RandomResizedCrop instead.
  "please use transforms.RandomResizedCrop instead.")
/home/leeor/anaconda3/envs/deduce/lib/python3.6/site-packages/torchvision/transforms/transforms.py:257: UserWarning: The use of the transforms.Scale transform is deprecated, please use transforms.Resize instead.
  "please use transforms.Resize instead.")
train_deduce_scene_home.py:174: UserWarning: This overload of cuda is deprecated:
	cuda(torch.device device, bool async, *, torch.memory_format memory_format)
Consider using one of the following signatures instead:
	cuda(torch.device device, bool non_blocking, *, torch.memory_format memory_format) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:766.)
  target = target.cuda(async=True)
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [1,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [2,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [3,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [4,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [5,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [6,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [7,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [8,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [9,0,0] Assertion `t >= 0 && t < n_classes` failed.
Traceback (most recent call last):
  File "train_deduce_scene_home.py", line 296, in <module>
    main()
  File "train_deduce_scene_home.py", line 141, in main
    train(train_loader, model, criterion, optimizer, epoch)
  File "train_deduce_scene_home.py", line 183, in train
    losses.update(loss, input.size(0))
  File "train_deduce_scene_home.py", line 269, in update
    self.avg = self.sum / self.count
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f2e9c4cf1e2 in /home/leeor/anaconda3/envs/deduce/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7f2e9c71df92 in /home/leeor/anaconda3/envs/deduce/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f2e9c4bd9cd in /home/leeor/anaconda3/envs/deduce/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0xa4ed59 (0x7f2ed7f86d59 in /home/leeor/anaconda3/envs/deduce/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0x2d7b593 (0x7f2eda2b3593 in /home/leeor/anaconda3/envs/deduce/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0x3376132 (0x7f2eda8ae132 in /home/leeor/anaconda3/envs/deduce/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #6: torch::autograd::deleteNode(torch::autograd::Node*) + 0x7f (0x7f2eda8ae1df in /home/leeor/anaconda3/envs/deduce/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0x3ec959 (0x7f2ee7f01959 in /home/leeor/anaconda3/envs/deduce/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #8: c10::TensorImpl::release_resources() + 0x20 (0x7f2e9c4bd9a0 in /home/leeor/anaconda3/envs/deduce/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #9: <unknown function> + 0x540ae2 (0x7f2ee8055ae2 in /home/leeor/anaconda3/envs/deduce/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x540b86 (0x7f2ee8055b86 in /home/leeor/anaconda3/envs/deduce/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #22: __libc_start_main + 0xe7 (0x7f2eeab2db97 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

As can see above, I get many unknown functions and at the end the core dump. Any ideas how to go about it?
(torch.cuda.is_available() returns True, and which gcc return usr/bin/gcc)

Thank you:)

albanD · August 21, 2020, 10:25pm

Hi,

THe import tant lines here are just before the stack trace:

/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.

Basically, in the nll class, you give something that is not between 0 and n_classes.
So I guess one of the labels you give to your criterion does not have a valid value.
Note that running the same code on CPU will give you a more user friendly error message.

Leeor · August 22, 2020, 3:19pm

Hey,
thank you very much for the answer!
I’m not sure how to run it on the CPU only, but I’ll try to figure.
Also, I didn’t change the code, I just forked it and wanted first to see that I can run it, so I wonder why this is the case.

albanD · August 24, 2020, 4:30pm

You might be using a different version of the dataset? Or they did some preprocessing on their dataset to remove these bad labels beforehand?

Leeor · August 25, 2020, 4:33am

Hey,
you were right. I’ve talked to the owner of that repo and he helped me understand the issue.
The problem was that I download the entire Places365 data set, which has 365 labels. However, he used only 7 labels. So after removing the unnecessary labels the code runs!

Thank you for the help