Hello experts,
I am using various pretrained models like ViT and CvT to do image classification.
My images are in RGB format, but one dimension is much larger than the pretrained model was trained on. Specifically I have images with dimension (3x224x1000) and the pretrained is (3x224x224).
To resize the images to the required dimensions, I want to apply a trainable/learnable dimensionality reduction technique that operates gradually. However, techniques like convolutional layers and max-pooling tend to reduce the size too aggressively, which results in a significant loss of information (and max-pool isn’t trainable).
For example:
## this moves too fast, losing information
self.conv = nn.Conv2d(in_channels=3, out_channels=3, kernel_size=(3,3), stride=(1,2), padding=1)
The stride in the width dimension cuts the size in half too quickly. I’m looking for a trainable, high-fidelity method to reduce the dimensions more gradually along the width axis without losing too much information in the process.
Any suggestions or guidance would be greatly appreciated!
If the image size is 224x1000, why not just cut it into 4 smaller images and run those through the vision part of the model? Then you can stack those on the batch dimension through the feed forward part of the pretrained model, and then concat them before a final trainable linear layer. That seems to make the most sense, vs. trying to train something from scratch on the input side.
I am worried about losing context and info by cutting through objects when I make the split. I suppose I could use a sliding window with overlap, and/or attention heads. It’s an interesting option, and I’m looking into it. Thanks for the suggestion!
I’m looking for as many options as possible since this is a hard classification task (subtle differences decide the class). And there is something desirable about simply adding a trainable dimension reduction layer in front of the model. The model will have to be trained for a long time to fine-tune anyway, so I’m not worried about training things from scratch.
One approach would be to use adaptive_avg_pool2d() to
reduce your width dimension as gradually as you like. You
could add trainability to this by interposing some trainable
convolution layers.
You could also recognize that as you reduce your width by a
factor of two with a stride = 2 convolution, you can “move”
spatial information into the channels dimension by increasing
the value of out_channels of your stride-2 convolution. You
can then use subsequent convolutions to move the information
that you put in the channels dimension back into your reduced
width dimension.
Something like this (but you would most likely want some
non-linear activations to get good trainability):