I’m useing ViT via vit_pytorch, a model is below,
ViT(
(to_patch_embedding): Sequential(
(0): Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1=16, p2=16)
(1): Linear(in_features=768, out_features=1024, bias=True)
)
(dropout): Dropout(p=0.1, inplace=False)
(transformer): Transformer(
(layers): ModuleList(
I input an image: torch.Size([1, 3, 128, 128]) and set a patch-size to 8 (16x16 with RGB image=768),
and I can get output:
torch.Size([1, 64, 1024]) after to_patch_embedding
torch.Size([1, 65, 1024]) after transformer
What is additional 256 features geneatred via Linear module in “to_patch_embedding”?
I’d like to restore/reconstruct a 2d image using the last torch.Size([1, 65, 1024]) , lets say it is a kind of feaure 2d image. Is it possible ?