How to understand the 768 of vit?

  • why is the vector length of vit patch embedding is 768, rather than other values ?
  • If we thought 768 = 16 x 16 with 3 channels, which is correct ?

16 x 16x3 or x3 x 16 x 16

There are two parts to it. Yes, 768 happens to be the number of inputs per patch in the 16x16 transformers, but it is not necessary to align these numbers. In fact, the 8x8 / 32x32 patch base VIT variants use 768 as an embedding dimension as well and the VIT Large will use 1024…
In practice, there is a linear layer (after you flatten the patch) or equivalent conv layer (with stride/kernel shape = patch size) to bring whatever the number of input values to the embedding dimension.

Best regards


1 Like