Replacing Dense Layer with Transformer in Image Model?

I was hoping to replace one Dense Layer with a Transformer in image classification for hopefully better performance. I’m hoping to replace the classifier section after the feature extraction with a transformer block. Does anyone have any tips/code to show how to do this? My current issue is that most transformer modes have target mask, but I’m guessing that won’t help when replacing a Dense Layer, and may even hurt it. I also think that since we are not using text, embedding is not needed. I’m guessing that positional encoding is useful, given that orders of features matter, but I’m not sure if a standard Positional Encoding to Transformer Block to Linear layer setup would work at all.
Any tips?

I would look at the vision transformer architecture (so e.g. DeiT-Tiny to name one).

They have an embedding layer, too, that splits the images into 16x16 blocks and applies a single convolution to many channels to then add a learned positional embedding.
You could use the output of the convolutional part as the blocks (so skipping that conv in the vision embedding) and just the learned embeddings.
The vision transformers also add a (learned) global point (the class token) to the patches. This is where they then hook the prediction head to at the end of the transformer. This class token originates from BERT and, combined with its attention, essentially replaces the pooling to go from patches to global information.

The file has a DeiTTiny class a bit further down.

Best regards