Do we have to count the background as an object class for SSD?

douglasrizzo · June 24, 2021, 10:09am

The documentation for the SSD class mentions that we should not count the background as an object class when passing the number of classes as a parameter to instantiate an SSD object.

github.com

pytorch/vision/blob/959666891589c35e6d225943253f523cffbae4cc/torchvision/models/detection/ssd.py#L144

    
      
              - scores (Tensor[N]): the scores for each detection
          
          
Args:
              backbone (nn.Module): the network used to compute the features for the model.
                  It should contain an out_channels attribute with the list of the output channels of
                  each feature map. The backbone should return a single Tensor or an OrderedDict[Tensor].
              anchor_generator (DefaultBoxGenerator): module that generates the default boxes for a
                  set of feature maps.
              size (Tuple[int, int]): the width and height to which images will be rescaled before feeding them
                  to the backbone.
              num_classes (int): number of output classes of the model (excluding the background).
              image_mean (Tuple[float, float, float]): mean values used for input normalization.
                  They are generally the mean values of the dataset on which the backbone has been trained
                  on
              image_std (Tuple[float, float, float]): std values used for input normalization.
                  They are generally the std values of the dataset on which the backbone has been trained on
              head (nn.Module, optional): Module run on top of the backbone features. Defaults to a module containing
                  a classification and regression module.
              score_thresh (float): Score threshold used for postprocessing the detections.
              nms_thresh (float): NMS threshold used for postprocessing the detections.
              detections_per_img (int): Number of best detections to keep after NMS.

However, further down in the same file, an SSD object is instantiated in a function that explicitly says that the background should be counted as an object class, but this is not taken into account in the code (i.e. I did not see num_classes be decremented by one when creating the SSD object).

github.com

pytorch/vision/blob/959666891589c35e6d225943253f523cffbae4cc/torchvision/models/detection/ssd.py#L589

    
      
          anchor_generator = DefaultBoxGenerator([[2], [2, 3], [2, 3], [2, 3], [2], [2]],
                                                 scales=[0.07, 0.15, 0.33, 0.51, 0.69, 0.87, 1.05],
                                                 steps=[8, 16, 32, 64, 100, 300])
          
          
defaults = {
              # Rescale the input in a way compatible to the backbone
              "image_mean": [0.48235, 0.45882, 0.40784],
              "image_std": [1.0 / 255.0, 1.0 / 255.0, 1.0 / 255.0],  # undo the 0-1 scaling of toTensor
          }
          kwargs = {**defaults, **kwargs}
          model = SSD(backbone, anchor_generator, (300, 300), num_classes, **kwargs)
          if pretrained:
              weights_name = 'ssd300_vgg16_coco'
              if model_urls.get(weights_name, None) is None:
                  raise ValueError("No checkpoint is available for model {}".format(weights_name))
              state_dict = load_state_dict_from_url(model_urls[weights_name], progress=progress)
              model.load_state_dict(state_dict)
          return model

Here is the documentation for this function, which says we should include the background in the number of classes.

github.com

pytorch/vision/blob/959666891589c35e6d225943253f523cffbae4cc/torchvision/models/detection/ssd.py#L563

    
      
          Example:
          
          
    >>> model = torchvision.models.detection.ssd300_vgg16(pretrained=True)
              >>> model.eval()
              >>> x = [torch.rand(3, 300, 300), torch.rand(3, 500, 400)]
              >>> predictions = model(x)
          
          
Args:
              pretrained (bool): If True, returns a model pre-trained on COCO train2017
              progress (bool): If True, displays a progress bar of the download to stderr
              num_classes (int): number of output classes of the model (including the background)
              pretrained_backbone (bool): If True, returns a model with backbone pre-trained on Imagenet
              trainable_backbone_layers (int): number of trainable (not frozen) resnet layers starting from final block.
                  Valid values are between 0 and 5, with 5 meaning all backbone layers are trainable.
          """
          if "size" in kwargs:
              warnings.warn("The size of the model is already fixed; ignoring the argument.")
          
          
trainable_backbone_layers = _validate_trainable_layers(
              pretrained or pretrained_backbone, trainable_backbone_layers, 5, 5)

This is confusing. Should we or should we not count the background as an object class when instantiating the SSD? In either case, how should object classes be ID’d during training?

As an example, with Faster RCNN, the background is counted as an object class (with ID 0 reserved for it) and actual object classes are identified during training starting from ID 1. What should be the procedure for SSD?

douglasrizzo · June 28, 2021, 2:57pm

This was solved here: Documentation confusing on whether SSD and RetinaNet count background as class object · Issue #4106 · pytorch/vision · GitHub