Extracting image features to feed to Transformer layer

I’m trying to understand how to use Transformers for OCR starting with something very basic like MNIST. My thought was to extract image patches from each MNIST image and then feed those into a Transformer with a classifier on the output. For example:

model = torch.nn.Sequential(
          torch.nn.Conv2d(1,16,3,3),
          torch.nn.ReLU(),
          torch.nn.Conv2d(16,32,3,3),
          torch.nn.ReLU()
        )

model(images).shape
> torch.Size([64, 32, 3, 3])

So here is a batch of 64 32-layer patches of 3x3. Given that the input to the Transformer is supposed to be a sequence of vectors, my thought is I should “unroll” the 3x3 patches to end up with a sequence of 32 9-dimensional vectors. Does this make sense?

Hi,

for me at least unrolling those 3x3 patches as you’ve proposed does not really make sense, as you would just replace the FC - classifier with a Transformer what seems kind of weird to me (as it’s not really a reasonable sequence - please correct me, if it should be so after all).
For your task, I would recommand checking out this paper. I am however not sure about how this approach could be adapted for more challenging OCR tasks.

Regards,
Unity05

Thanks @Unity05. I tried with Transformer and without, and performance did not improve much so perhaps you’re right. Maybe it will make more sense with a string of characters.

1 Like

Indeed. Excited to see if you manage do adapt it for general OCR use case. :slight_smile:

1 Like