I am using first few layers of the pretrained alexnet to extract features from a given image and apply RoI pool layer to get a fixed dimensional feature vector corresponsing to one of the Bounding Boxes (BBox) in the image that I pass to a single layered classifier.
My question is that, now I also want to use some extra features to improve the classification performance. I have about 3000 dimensional visual features and 4 other features (x, y, width, height) of the BBox. Should I just concatenate the 3000 dim visual features with the 4 extra features?
- Is it possible that the 3000 dim will overpower the classifier and not let the classifier learn something significant from the 4 extra features since 3000 is much greater than 4?
- Should I do some kind of normalization before concatenating the 3000 dim features with the 4 dim features?
- Is it a good idea to add a batch norm layer on the 3000 dim features and another batch norm on the 4 dim features and then concatenate these?
- Should I project the 3000 dims to a smaller dimension or the 4 dims to a higher dimension and then concatenate them?
- Is there a better way than concatenation to use the two types of features for classification?
Thanks!