If if pass the full model into fp16 at training mode the model return a dict of losses. Some losses are NaN values.
Has anyone make MaskRCNN work at FP16?
If if pass the full model into fp16 at training mode the model return a dict of losses. Some losses are NaN values.
Has anyone make MaskRCNN work at FP16?
I would recommend to use automatic mixed precision training via torch.cuda.amp by installing the nightly binaries.
Simply transforming your model to float16
might easily create overflows/underflows and thus might yield NaNs.
The thing is that MaskRCNN return a dict of losses and not just a loss value
In addition looks like MaskRCNN doesn’t supports torch.cuda.amp.autocast:
~/Documents/test/seg/models/archs/mask_rcnn.py in mixed_precision_one_batch(self, i, b)
186 with autocast():
187 self.model.train()
--> 188 loss_dict = self.model(images,targets)
189 if not self.training:
190 self.model.eval()
~/anaconda3/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
556 result = self._slow_forward(*input, **kwargs)
557 else:
--> 558 result = self.forward(*input, **kwargs)
559 for hook in self._forward_hooks.values():
560 hook_result = hook(self, input, result)
~/anaconda3/envs/seg/lib/python3.7/site-packages/torchvision/models/detection/generalized_rcnn.py in forward(self, images, targets)
68 if isinstance(features, torch.Tensor):
69 features = OrderedDict([('0', features)])
---> 70 proposals, proposal_losses = self.rpn(images, features, targets)
71 detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
72 detections = self.transform.postprocess(detections, images.image_sizes, original_image_sizes)
~/anaconda3/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
556 result = self._slow_forward(*input, **kwargs)
557 else:
--> 558 result = self.forward(*input, **kwargs)
559 for hook in self._forward_hooks.values():
560 hook_result = hook(self, input, result)
~/anaconda3/envs/seg/lib/python3.7/site-packages/torchvision/models/detection/rpn.py in forward(self, images, features, targets)
486 proposals = self.box_coder.decode(pred_bbox_deltas.detach(), anchors)
487 proposals = proposals.view(num_images, -1, 4)
--> 488 boxes, scores = self.filter_proposals(proposals, objectness, images.image_sizes, num_anchors_per_level)
489
490 losses = {}
~/anaconda3/envs/seg/lib/python3.7/site-packages/torchvision/models/detection/rpn.py in filter_proposals(self, proposals, objectness, image_shapes, num_anchors_per_level)
408 boxes, scores, lvl = boxes[keep], scores[keep], lvl[keep]
409 # non-maximum suppression, independently done per level
--> 410 keep = box_ops.batched_nms(boxes, scores, lvl, self.nms_thresh)
411 # keep only topk scoring predictions
412 keep = keep[:self.post_nms_top_n()]
~/anaconda3/envs/seg/lib/python3.7/site-packages/torchvision/ops/boxes.py in batched_nms(boxes, scores, idxs, iou_threshold)
73 offsets = idxs.to(boxes) * (max_coordinate + 1)
74 boxes_for_nms = boxes + offsets[:, None]
---> 75 keep = nms(boxes_for_nms, scores, iou_threshold)
76 return keep
77
~/anaconda3/envs/seg/lib/python3.7/site-packages/torchvision/ops/boxes.py in nms(boxes, scores, iou_threshold)
33 by NMS, sorted in decreasing order of scores
34 """
---> 35 return torch.ops.torchvision.nms(boxes, scores, iou_threshold)
36
37
RuntimeError: Unrecognized tensor type ID: Autocast
Since nms
is a custom extension, you might need to disable autocasting for it as described here.
CC @mcarilli to add more information in case I’m missing something.
Okey, let me know in github if you discover what is originating the issue and how to solve it!!!
Did you have time to look at it?