I am working on object detection and I have a dataset containing images and their corresponding bounding boxes (ground-truth values).
I actually have built my own feature extractor which takes an image as input and outputs a feature map(basically an encoder-decoder system where the final output of the decoder is the same as the image size and has 3 channels). Now, I want to feed this feature map as an input to a FasterRCNN model for detection. I have 2 main doubts at this point
- Is it okay if I skip training the feature extractor and just train the FasterRCNN for detection?
- If I should also train the feature extractor, what will be my labels while training?
I am actually a beginner in the field of Computer Vision and using Pytorch. It would be really helpful if anyone could guide me on how to approach this problem.