MaskRCNN: CUDA error: an illegal memory access was encountered

I am working on the MaskRCNN Pytorch inbuilt model. When I trying to access the loss value that I got output from the model. I am getting following error:

Traceback (most recent call last):

File "/home/droneslab/projects/plastics/shri/DRONES_Lab/trainer-py", line 78, in _one_epoch_train batch_loss_train.append(total loss.item())

RuntimeError: CUDA error: an illegal memory access was encountered

CUDA kernel errors might

be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Compile with *TORCH_USE_CUDA_DSA* to enable device-side assertion

It’s not like I am getting this error every epoch. I am getting intermittent. Making me unable to log metrics by both custom logging or tensorboard.

If you are using an older PyTorch release, could you update to the latest stable or nightly release and check if you are still seeing the issue? If so, could you create a CUDA coredump and check which kernel causes the memory violation?