Mask RCNN predicting no BB during training

I am training biomedical images to detect a trace using Masked RCNN. However, during training, for some batch of images, the MRCNN does not output any bounding box proposal regions and this is causing the loss to be Nan. Since I am training on GPU, the training stops in between with below error,
" RuntimeError: CUDA error: an illegal memory access was encountered

CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect."

When I traced back the error, I could see the loss values to be Nan which in turn is because there are no bounding boxes predicted by the MRCNN.

Would be great if someone can help me understand this error better and how to fix it.

Could you rerun the code with CUDA_LAUNCH_BLOCKING=1 and check the stacktrace to see which operation fails?
I’m currently unsure if the illegal memory access is created by using invalid values (e.g. NaNs) in a method which is not checking the input tensors for these values.

I re ran the code with CUDA_LAUNCH_BLOCKING=1 and I get below stacktrace.

Traceback (most recent call last):
File “train.py”, line 27, in main
return train(config)
File “/home/haicu/harshavardhan.subramanian/AIS_DL/AIS_DL/src/ml_pipeline_template/training_pipeline.py”, line 87, in train
trainer.fit(model=model, datamodule=datamodule)
File “/home/haicu/harshavardhan.subramanian/miniconda3/envs/ml_template_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 741, in fit
self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
File “/home/haicu/harshavardhan.subramanian/miniconda3/envs/ml_template_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 685, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File “/home/haicu/harshavardhan.subramanian/miniconda3/envs/ml_template_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 777, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File “/home/haicu/harshavardhan.subramanian/miniconda3/envs/ml_template_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 1199, in _run
self._dispatch()
File “/home/haicu/harshavardhan.subramanian/miniconda3/envs/ml_template_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 1279, in _dispatch
self.training_type_plugin.start_training(self)
File “/home/haicu/harshavardhan.subramanian/miniconda3/envs/ml_template_env/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py”, line 202, in start_training
self._results = trainer.run_stage()
File “/home/haicu/harshavardhan.subramanian/miniconda3/envs/ml_template_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 1289, in run_stage
return self._run_train()
File “/home/haicu/harshavardhan.subramanian/miniconda3/envs/ml_template_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 1311, in _run_train
self._run_sanity_check(self.lightning_module)
File “/home/haicu/harshavardhan.subramanian/miniconda3/envs/ml_template_env/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py”, line 1375, in _run_sanity_check
self._evaluation_loop.run()
File “/home/haicu/harshavardhan.subramanian/miniconda3/envs/ml_template_env/lib/python3.7/site-packages/pytorch_lightning/loops/base.py”, line 145, in run
self.advance(*args, **kwargs)
File “/home/haicu/harshavardhan.subramanian/miniconda3/envs/ml_template_env/lib/python3.7/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py”, line 110, in advance
dl_outputs = self.epoch_loop.run(dataloader, dataloader_idx, dl_max_batches, self.num_dataloaders)
File “/home/haicu/harshavardhan.subramanian/miniconda3/envs/ml_template_env/lib/python3.7/site-packages/pytorch_lightning/loops/base.py”, line 145, in run
self.advance(*args, **kwargs)
File “/home/haicu/harshavardhan.subramanian/miniconda3/envs/ml_template_env/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py”, line 122, in advance
output = self._evaluation_step(batch, batch_idx, dataloader_idx)
File “/home/haicu/harshavardhan.subramanian/miniconda3/envs/ml_template_env/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py”, line 217, in _evaluation_step
output = self.trainer.accelerator.validation_step(step_kwargs)
File “/home/haicu/harshavardhan.subramanian/miniconda3/envs/ml_template_env/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py”, line 239, in validation_step
return self.training_type_plugin.validation_step(*step_kwargs.values())
File “/home/haicu/harshavardhan.subramanian/miniconda3/envs/ml_template_env/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py”, line 219, in validation_step
return self.model.validation_step(*args, **kwargs)
File “/home/haicu/harshavardhan.subramanian/AIS_DL/AIS_DL/src/ml_pipeline_template/models/AIS_UNetModule.py”, line 106, in validation_step
loss, loss_mask, loss_box_reg, loss_classifier = self.step(batch)
File “/home/haicu/harshavardhan.subramanian/AIS_DL/AIS_DL/src/ml_pipeline_template/models/AIS_UNetModule.py”, line 55, in step
loss_dict = self.forward(images, targets)
File “/home/haicu/harshavardhan.subramanian/AIS_DL/AIS_DL/src/ml_pipeline_template/models/AIS_UNetModule.py”, line 40, in forward
return self.net(x,y)
File “/home/haicu/harshavardhan.subramanian/miniconda3/envs/ml_template_env/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 1110, in _call_impl
return forward_call(*input, **kwargs)
File “/home/haicu/harshavardhan.subramanian/AIS_DL/AIS_DL/src/ml_pipeline_template/models/components/MRCNN.py”, line 51, in forward
return self.model(x,y)
File “/home/haicu/harshavardhan.subramanian/miniconda3/envs/ml_template_env/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 1110, in _call_impl
return forward_call(*input, **kwargs)
File “/home/haicu/harshavardhan.subramanian/miniconda3/envs/ml_template_env/lib/python3.7/site-packages/torchvision/models/detection/generalized_rcnn.py”, line 99, in forward
detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
File “/home/haicu/harshavardhan.subramanian/miniconda3/envs/ml_template_env/lib/python3.7/site-packages/torch/nn/modules/module.py”, line 1110, in _call_impl
return forward_call(*input, **kwargs)
File “/home/haicu/harshavardhan.subramanian/miniconda3/envs/ml_template_env/lib/python3.7/site-packages/torchvision/models/detection/roi_heads.py”, line 803, in forward
rcnn_loss_mask = maskrcnn_loss(mask_logits, mask_proposals, gt_masks, gt_labels, pos_matched_idxs)
File “/home/haicu/harshavardhan.subramanian/miniconda3/envs/ml_template_env/lib/python3.7/site-packages/torchvision/models/detection/roi_heads.py”, line 113, in maskrcnn_loss
project_masks_on_boxes(m, p, i, discretization_size) for m, p, i in zip(gt_masks, proposals, mask_matched_idxs)
File “/home/haicu/harshavardhan.subramanian/miniconda3/envs/ml_template_env/lib/python3.7/site-packages/torchvision/models/detection/roi_heads.py”, line 113, in
project_masks_on_boxes(m, p, i, discretization_size) for m, p, i in zip(gt_masks, proposals, mask_matched_idxs)
File “/home/haicu/harshavardhan.subramanian/miniconda3/envs/ml_template_env/lib/python3.7/site-packages/torchvision/models/detection/roi_heads.py”, line 95, in project_masks_on_boxes
return roi_align(gt_masks, rois, (M, M), 1.0)[:, 0]
File “/home/haicu/harshavardhan.subramanian/miniconda3/envs/ml_template_env/lib/python3.7/site-packages/torchvision/ops/roi_align.py”, line 62, in roi_align
input, rois, spatial_scale, output_size[0], output_size[1], sampling_ratio, aligned
RuntimeError: CUDA error: an illegal memory access was encountered

It seems the roi_align call is running into the memory violation. Could you check the shapes and values of all tensors to see if some of them might be invalid and if the layer is missing checks?

I checked the input values and shape of all the tensors at each step. But the roi values outputs empty list for certain images in a batch and when the roi_align() method tries to return torch.ops.torchvision.roi_align(
input, rois, spatial_scale, output_size[0], output_size[1], sampling_ratio, aligned
) with empty list of rois then I get the above CUDA error.

Thanks for the update!
Are you seeing the illegal memory access using a similar code like this?

input = torch.randn(2, 3, 224, 224, device='cuda')
torchvision.ops.roi_align(input, [], output_size=(100, 100))

No I am not seeing the error when I run this code. It is very weird that at some steps though the roi is an empty list, "torch.ops.torchvision.roi_align(
input, rois, spatial_scale, output_size[0], output_size[1], sampling_ratio, aligned
) " returns an empty list and I do not get any error.
I am printing input, roi at every step and it is getting hard to debug where exactly the model is failing.
But the error is same and stack trace is same.
Another interesting thing is, I do not get the error if I use SGD optimizer, but I get the loss as Nan at the end of each epoch and I get this prompt in my slurm output file, “Trainer was signaled to stop but required minimum epochs (200) or minimum steps (None) has not been met. Training will continue…%1B[0m”.

I guess the “trainer” output is coming from Lightning and don’t know how to interpret it.
Maybe this module is getting some invalid values and then misses to check them before executing the kernel. I’ll try to rerun it with different invalid inputs and see how roi_align would behave.