Error While Finetuning FasterRCNN

raj-rishav · March 11, 2023, 9:01am

Hello Guys, I am trying to finetune the FasterRCNN model. Dataset I got has 3 different labesl for objects vehicle, traffic_light and pedestrian.

class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, root, transforms):
        self.root = root
        self.transforms = transforms
        # load all image files, sorting them to
        # ensure that they are aligned
        self.imgs = list(sorted(os.listdir(self.root)))

    def __getitem__(self, idx):
        # load images
        img_path = os.path.join(self.root, self.imgs[idx])
        img = Image.open(img_path).convert("RGB")
        
        img_data = get_image_data(self.imgs[idx]) 

        # get bounding box coordinates for each mask
        num_objs = len(img_data["category_id"])
        boxes = img_data["bbox"]


        # convert everything into a torch.Tensor
        boxes = torch.as_tensor(boxes, dtype=torch.float32)

        # ISSUE HERE

        labels = torch.tensor(img_data["category_id"], dtype=torch.int64)
        iscrowd = torch.zeros((num_objs,), dtype=torch.int64)
        image_id = torch.tensor(img_data["image_id"])
        area = torch.tensor(img_data["area"], dtype=torch.float32)
        target = {}
        target["boxes"] = boxes
        target["labels"] = labels
        target["image_id"] = image_id
        target["area"] = area
        target["iscrowd"] = iscrowd
        if self.transforms is not None:
            img, target = self.transforms(img, target)

        return img, target

    def __len__(self):
        return len(self.imgs)

Above is the dataset class I am using.
Below is the error trace, Please let me know if you can help with it.

File ~/.conda/envs/py/lib/python3.9/site-packages/torchvision/models/detection/roi_heads.py:772, in RoIHeads.forward(self, features, proposals, image_shapes, targets)
    770     if regression_targets is None:
    771         raise ValueError("regression_targets cannot be None")
--> 772     loss_classifier, loss_box_reg = fastrcnn_loss(class_logits, box_regression, labels, regression_targets)
    773     losses = {"loss_classifier": loss_classifier, "loss_box_reg": loss_box_reg}
    774 else:

File ~/.conda/envs/py/lib/python3.9/site-packages/torchvision/models/detection/roi_heads.py:36, in fastrcnn_loss(class_logits, box_regression, labels, regression_targets)
     31 classification_loss = F.cross_entropy(class_logits, labels)
     33 # get indices that correspond to the regression targets for
     34 # the corresponding ground truth labels, to be used with
     35 # advanced indexing
---> 36 sampled_pos_inds_subset = torch.where(labels > 0)[0]
     37 labels_pos = labels[sampled_pos_inds_subset]
     38 N, num_classes = class_logits.shape

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

raj-rishav · March 11, 2023, 9:15am

So I updated iscrowd to:
iscrowd = torch.zeros((boxes.shape[0],), dtype=torch.int64)

But now the error shows up while first pass to the model,
loss_dict = model(images, targets)

ERROR:

---> 15 loss_dict = model(images, targets)
     16 losses = sum(loss for loss in loss_dict.values())
     17 loss_value = losses.item()

File ~/.conda/envs/py/lib/python3.9/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File ~/.conda/envs/py/lib/python3.9/site-packages/torchvision/models/detection/generalized_rcnn.py:93, in GeneralizedRCNN.forward(self, images, targets)
     90 degenerate_boxes = boxes[:, 2:] <= boxes[:, :2]
     91 if degenerate_boxes.any():
     92     # print the first degenerate box
---> 93     bb_idx = torch.where(degenerate_boxes.any(dim=1))[0][0]
     94     degen_bb: List[float] = boxes[bb_idx].tolist()
     95     torch._assert(
     96         False,
     97         "All bounding boxes should have positive height and width."
     98         f" Found invalid box {degen_bb} for target at index {target_idx}.",
     99     )

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

ptrblck · March 11, 2023, 9:24am

Did you try rerunning the code via CUDA_LAUNCH_BLOCKING=1 as given in the error message?
This should give you a stacktrace pointing to the actual failing line of code, which could e.g. raise an indexing error. Alternatively, you could also try to run the code on the CPU and see if the error message gives you more information.

raj-rishav · March 11, 2023, 9:36am

Hey, thanks for the reply. And yeah I did try to run on the CPU. Following error showed up . I am trying to work on it. Thanks for the help and patience.

RuntimeError                              Traceback (most recent call last)
Cell In[71], line 31
     28         prog_bar.set_description(desc=f"Loss: {loss_value:.4f}")
     29     return train_loss_list
---> 31 train(train_data_loader, model)

Cell In[71], line 18, in train(train_data_loader, model)
     16 images = list(image.to(DEVICE) for image in images)
     17 targets = [{k: v.to(DEVICE) for k, v in t.items()} for t in targets]
---> 18 loss_dict = model(images, targets)
     19 losses = sum(loss for loss in loss_dict.values())
     20 loss_value = losses.item()

File ~/.conda/envs/py/lib/python3.9/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File ~/.conda/envs/py/lib/python3.9/site-packages/torchvision/models/detection/generalized_rcnn.py:101, in GeneralizedRCNN.forward(self, images, targets)
     94             degen_bb: List[float] = boxes[bb_idx].tolist()
     95             torch._assert(
     96                 False,
     97                 "All bounding boxes should have positive height and width."
     98                 f" Found invalid box {degen_bb} for target at index {target_idx}.",
     99             )
--> 101 features = self.backbone(images.tensors)
    102 if isinstance(features, torch.Tensor):
    103     features = OrderedDict([("0", features)])

File ~/.conda/envs/py/lib/python3.9/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File ~/.conda/envs/py/lib/python3.9/site-packages/torchvision/models/detection/backbone_utils.py:58, in BackboneWithFPN.forward(self, x)
     56 def forward(self, x: Tensor) -> Dict[str, Tensor]:
     57     x = self.body(x)
---> 58     x = self.fpn(x)
     59     return x

File ~/.conda/envs/py/lib/python3.9/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File ~/.conda/envs/py/lib/python3.9/site-packages/torchvision/ops/feature_pyramid_network.py:196, in FeaturePyramidNetwork.forward(self, x)
    194     inner_top_down = F.interpolate(last_inner, size=feat_shape, mode="nearest")
    195     last_inner = inner_lateral + inner_top_down
--> 196     results.insert(0, self.get_result_from_layer_blocks(last_inner, idx))
    198 if self.extra_blocks is not None:
    199     results, names = self.extra_blocks(results, x, names)

File ~/.conda/envs/py/lib/python3.9/site-packages/torchvision/ops/feature_pyramid_network.py:169, in FeaturePyramidNetwork.get_result_from_layer_blocks(self, x, idx)
    167 for i, module in enumerate(self.layer_blocks):
    168     if i == idx:
--> 169         out = module(x)
    170 return out

File ~/.conda/envs/py/lib/python3.9/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File ~/.conda/envs/py/lib/python3.9/site-packages/torch/nn/modules/container.py:204, in Sequential.forward(self, input)
    202 def forward(self, input):
    203     for module in self:
--> 204         input = module(input)
    205     return input

File ~/.conda/envs/py/lib/python3.9/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File ~/.conda/envs/py/lib/python3.9/site-packages/torch/nn/modules/conv.py:463, in Conv2d.forward(self, input)
    462 def forward(self, input: Tensor) -> Tensor:
--> 463     return self._conv_forward(input, self.weight, self.bias)

File ~/.conda/envs/py/lib/python3.9/site-packages/torch/nn/modules/conv.py:459, in Conv2d._conv_forward(self, input, weight, bias)
    455 if self.padding_mode != 'zeros':
    456     return F.conv2d(F.pad(input, self._reversed_padding_repeated_twice, mode=self.padding_mode),
    457                     weight, bias, self.stride,
    458                     _pair(0), self.dilation, self.groups)
--> 459 return F.conv2d(input, weight, bias, self.stride,
    460                 self.padding, self.dilation, self.groups)

File ~/.conda/envs/py/lib/python3.9/site-packages/torch/utils/data/_utils/signal_handling.py:66, in _set_SIGCHLD_handler.<locals>.handler(signum, frame)
     63 def handler(signum, frame):
     64     # This following call uses `waitid` with WNOHANG from C side. Therefore,
     65     # Python can still get and update the process status successfully.
---> 66     _error_if_any_worker_fails()
     67     if previous_handler is not None:
     68         assert callable(previous_handler)

RuntimeError: DataLoader worker (pid 10920) is killed by signal: Killed.

ptrblck · March 11, 2023, 9:38am

Try to set num_workers=0 and rerun the code as the stacktrace is still missing the actual error message.

raj-rishav · March 11, 2023, 9:45am

Done. 2 batches trained till now.

I dont know what exactly is going on here

raj-rishav · March 11, 2023, 9:48am

So now I just started running on gpu, with num_workers=0 and its running. Can you please enlighten me about what happened exactly.

Thanks a lot though

ptrblck · March 11, 2023, 9:54am

Unfortunately, I don’t know what might be causing the DataLoader issue, as they are usually not reproducible. You could try to iterate the DataLoader alone using multiple workers and see if this would also fail in order to isolate the issue further.

raj-rishav · March 11, 2023, 9:56am

Sure thing, I will try doing that and update here. Hopefully its not the data being a bit weird and causing that.

Again thank you