Weird CUDA error

shahy · May 7, 2020, 11:14pm

Yes it was working before but this error started showing up when CUDA 10.2 was released.
GPU capability couldn’t be used with 10.2 so I switched to 10.1 and installed using:
pip install torch==1.5.0+cu101 torchvision==0.6.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html
Pytorch and torch versions:
torch==1.5.0+cu101 torchvision==0.6.0+cu101

Stack trace with CUDA_LAUNCH_BLOCKING=1:

THCudaCheck FAIL file=/pytorch/aten/src/THCUNN/generic/SpatialClassNLLCriterion.cu line=134 error=700 : an illegal memory access was encountered
Traceback (most recent call last):
  File "main.py", line 40, in <module>
    train_correspondence_block(root_dir)
  File "/home/jovyan/work/correspondence_block.py", line 77, in train_correspondence_block
    loss_id = criterion_id(idmask_pred, idmask)
  File "/opt/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/venv/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 932, in forward
    ignore_index=self.ignore_index, reduction=self.reduction)
  File "/opt/venv/lib/python3.7/site-packages/torch/nn/functional.py", line 2317, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "/opt/venv/lib/python3.7/site-packages/torch/nn/functional.py", line 2117, in nll_loss
    ret = torch._C._nn.nll_loss2d(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: cuda runtime error (700) : an illegal memory access was encountered at /pytorch/aten/src/THCUNN/generic/SpatialClassNLLCriterion.cu:134