I read through the Overfeat Lua implementation, @Soumith_Chintala’s fork as well as Sermanet’s C implementation. I have a few questions and I hope you will be kind enough to help me out and make me understand localization and detection with convnets.
In the overfeat paper, it is mentioned that a regression network was stacked on the previous layers (i.e. excluding the classifier layer) so that localization + detection is done in this regression layer. I have read through the c code and the Lua code and I see no where where this is explicitly implemented ( maybe I am missing something). Could you clarify this for me please?
Secondly, it is mentioned in the paper that bounding boxes are predicted and used to localize and detect objects in the image. This also, I did not find in any of the codes I have seen. They mentioned that the bounding box is provided alongside the images as input to the convnet. How is the ground truth bounding box generated? Was it manually drawn over the image before being forwarded through the net or was it specified in code? If it was specified in code, how are the aspect ratios generated? I have been looking around but have found little pointers on how this is actually implemented. Would appreciate a clarification with respect to this.
If I have just one object I am trying to identify in a scene, localize and detect, I am assuming that using the proposed detection and localization as in overfeat, r-CNN, or single-shot detector, I would not have to crop my training images to contain just the roi in the image that I am trying to find since this is the whole idea of multiple bounding box predictions, and detection . Am I correct? How does one avoid multiple predictions and just get the bounding box for the one object one is trying to find in a image?
How easy would it be to stack the regression net such as in overfeat on top of the resnet18 layers, save the last layer, in pytorch, for example?
Would appreciate help.