(Vision Transformer) On using subsets of a "large" PositionEmbedding matrix for smaller images


I am modifying the VisionTransformer implementation in timm to preserve the aspect ratios of input images (as best as possible).

The original ViT (as implemented in timm and in the original JAX code) expects square images. But you can still generate patch embeddings for an image of arbitrary size.

What remains is to add Position Embeddings to each of these patches before passing to the Transformer Encoder.

There is a maximum aspect ratio that I work with (say 1:2 :: h:w). At the moment, I initialize the position embeddings for the largest possible image, and use the top-n embeddings based on the n patches that the input image generates.

I feel though that this approach is flawed.

From the ViT paper, it seems that the patches are unrolled left-to-right, top-to-bottom. IMO, this doesn’t account for any 2D positioning of the patches, but rather imposes an arbitrary but consistent ordering (i.e. left-to-right, top-to-bottom). This logic lends some sense to my approach.

At the same time, if the k-th embedding represents the top-right corner patch of the largest image, then this k-th embedding might be re-purposed to represent a center patch of a smaller image.

Was wondering what people make of this.