Hello,
I forked and try to train the code from this repo.
I’m running on Ubuntu 18; RTX2080, Pytroch 1.6, Cuda 10.2
When I initiate the training command I get this huge error message, and I’ve no idea how to approach it.
/home/leeor/anaconda3/envs/deduce/lib/python3.6/site-packages/torchvision/transforms/transforms.py:752: UserWarning: The use of the transforms.RandomSizedCrop transform is deprecated, please use transforms.RandomResizedCrop instead.
"please use transforms.RandomResizedCrop instead.")
/home/leeor/anaconda3/envs/deduce/lib/python3.6/site-packages/torchvision/transforms/transforms.py:257: UserWarning: The use of the transforms.Scale transform is deprecated, please use transforms.Resize instead.
"please use transforms.Resize instead.")
train_deduce_scene_home.py:174: UserWarning: This overload of cuda is deprecated:
cuda(torch.device device, bool async, *, torch.memory_format memory_format)
Consider using one of the following signatures instead:
cuda(torch.device device, bool non_blocking, *, torch.memory_format memory_format) (Triggered internally at /pytorch/torch/csrc/utils/python_arg_parser.cpp:766.)
target = target.cuda(async=True)
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [1,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [2,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [3,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [4,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [5,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [6,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [7,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [8,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [9,0,0] Assertion `t >= 0 && t < n_classes` failed.
Traceback (most recent call last):
File "train_deduce_scene_home.py", line 296, in <module>
main()
File "train_deduce_scene_home.py", line 141, in main
train(train_loader, model, criterion, optimizer, epoch)
File "train_deduce_scene_home.py", line 183, in train
losses.update(loss, input.size(0))
File "train_deduce_scene_home.py", line 269, in update
self.avg = self.sum / self.count
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f2e9c4cf1e2 in /home/leeor/anaconda3/envs/deduce/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7f2e9c71df92 in /home/leeor/anaconda3/envs/deduce/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f2e9c4bd9cd in /home/leeor/anaconda3/envs/deduce/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0xa4ed59 (0x7f2ed7f86d59 in /home/leeor/anaconda3/envs/deduce/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0x2d7b593 (0x7f2eda2b3593 in /home/leeor/anaconda3/envs/deduce/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0x3376132 (0x7f2eda8ae132 in /home/leeor/anaconda3/envs/deduce/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #6: torch::autograd::deleteNode(torch::autograd::Node*) + 0x7f (0x7f2eda8ae1df in /home/leeor/anaconda3/envs/deduce/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0x3ec959 (0x7f2ee7f01959 in /home/leeor/anaconda3/envs/deduce/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #8: c10::TensorImpl::release_resources() + 0x20 (0x7f2e9c4bd9a0 in /home/leeor/anaconda3/envs/deduce/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #9: <unknown function> + 0x540ae2 (0x7f2ee8055ae2 in /home/leeor/anaconda3/envs/deduce/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x540b86 (0x7f2ee8055b86 in /home/leeor/anaconda3/envs/deduce/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #22: __libc_start_main + 0xe7 (0x7f2eeab2db97 in /lib/x86_64-linux-gnu/libc.so.6)
Aborted (core dumped)
As can see above, I get many unknown functions and at the end the core dump. Any ideas how to go about it?
(torch.cuda.is_available() returns True, and which gcc return usr/bin/gcc)
Thank you:)