What is additional 256 features in ViT and an image restoration

I’m useing ViT via vit_pytorch, a model is below,

ViT(
  (to_patch_embedding): Sequential(
    (0): Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1=16, p2=16)
    (1): Linear(in_features=768, out_features=1024, bias=True)
  )
  (dropout): Dropout(p=0.1, inplace=False)
  (transformer): Transformer(
    (layers): ModuleList(

I input an image: torch.Size([1, 3, 128, 128]) and set a patch-size to 8 (16x16 with RGB image=768),
and I can get output:

torch.Size([1, 64, 1024]) after to_patch_embedding
torch.Size([1, 65, 1024]) after transformer

What is additional 256 features geneatred via Linear module in “to_patch_embedding”?

I’d like to restore/reconstruct a 2d image using the last torch.Size([1, 65, 1024]) , lets say it is a kind of feaure 2d image. Is it possible ?

Which 256 features are you referring to? Could you describe this question in more details, please?

Thanks for the kind responce.

I input an image: torch.Size([1, 3, 128, 128]) and set a patch-size to 8 (where #of features is 16x16 with RGB image=768), and I can get output:

torch.Size([1, 64, 1024]) after "to_patch_embedding"
torch.Size([1, 65, 1024]) after "transformer"

When I look in the output of “to_patch_embedding” additional 256 features is added to the initial 768 features. Does this mean that “Linear” module just increases the amount of features using, for instance, Multi Layer Parceptron?

The above one was my first question, and second question is below

In the end of “transformer”, I can get the feature of “torch.Size([1, 65, 1024])”, the first array [:,0:1,:] means class token if I understand correctly, and the rest of the array is the feature of the image I input.

Here, what I’d like to do is to restore/reconstruct a 2d image using the last output “torch.Size([1, 65, 1024])”, Is it possible?

If my question is not a proper one in this forum, I’ll close the question.

Yes, the linear layer maps the 768 features from the input activation (i.e. the output of the Rearrange module) to 1024 features:

 Linear(in_features=768, out_features=1024, bias=True)

I don’t know how the image is related to these output features, but assuming the features were created from an image I could imagine that restoring them could work. However, it also highly depends on the model architecture, the input image resolution etc. You could check some approaches to restore images using e.g. autoencoders.

Thank you for your kind comments

Yes, the linear layer maps the 768 features from the input activation

I understand that the additional 256 features which are added via “Linear module” are unknown.

I don’t know how the image is related to these output features, …

Yes, one approach can be to put decoder after the transformer, and feed the output features to the decorder. Before doing it, I wantted to ask experts if other approach is possible to restore the input image.

Best regards,

No, these features are neither “unknown” nor are they added. The linear layer uses a linear transformation using its trainable parameters.
Here is a small example showing it with just the weight matrix (the bias was skipped to simplify the example):

input_features = torch.arange(5).view(1, -1).float()
print(input_features)
# tensor([[0., 1., 2., 3., 4.]])

lin = nn.Linear(in_features=5, out_features=7, bias=False)
print(lin.weight)
# Parameter containing:
# tensor([[-0.1611, -0.4197, -0.0630,  0.0092, -0.1258],
#         [-0.1206,  0.3807,  0.0975, -0.0231,  0.1823],
#         [-0.1722, -0.4393, -0.2412, -0.4215, -0.2296],
#         [-0.4187,  0.2990,  0.1310, -0.3879,  0.0489],
#         [ 0.3792, -0.3314, -0.3503, -0.3338, -0.2754],
#         [-0.0981,  0.3740, -0.1645,  0.2752, -0.1942],
#         [ 0.1042,  0.1736,  0.0024,  0.2128, -0.0113]], requires_grad=True)

out = lin(input_features)
print(out)
# tensor([[-1.0211,  1.2358, -3.1047, -0.4071, -3.1350,  0.0937,  0.7719]],
#        grad_fn=<MmBackward0>)

As you can see, the linear layer performs a matrix multiplication using the input_feature and its weight matrix to create the output features. No features are just "added` to the inputs or so, since the input is transformed.

1 Like

Thank you for your kind comment again with an exsample!