Thank you for your kind comments
Yes, the linear layer maps the
768features from the input activation
I understand that the additional 256 features which are added via “Linear module” are unknown.
I don’t know how the image is related to these output features, …
Yes, one approach can be to put decoder after the transformer, and feed the output features to the decorder. Before doing it, I wantted to ask experts if other approach is possible to restore the input image.
Best regards,