I’m useing ViT via vit_pytorch, a model is below,
ViT( (to_patch_embedding): Sequential( (0): Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1=16, p2=16) (1): Linear(in_features=768, out_features=1024, bias=True) ) (dropout): Dropout(p=0.1, inplace=False) (transformer): Transformer( (layers): ModuleList(
I input an image: torch.Size([1, 3, 128, 128]) and set a patch-size to 8 (16x16 with RGB image=768),
and I can get output:
torch.Size([1, 64, 1024]) after to_patch_embedding torch.Size([1, 65, 1024]) after transformer
What is additional 256 features geneatred via Linear module in “to_patch_embedding”?
I’d like to restore/reconstruct a 2d image using the last torch.Size([1, 65, 1024]) , lets say it is a kind of feaure 2d image. Is it possible ?