Accuracy Drop in ViT with Patch Embedding: Investigating the Impact of Added Convolutional Layers

Hello, I’m currently working on incorporating a patch embedding layer into my Vision Transformer (ViT). I’ve defined this layer using 3 2D convolutional and initialized it with a normal distribution. The remaining layers of the model use pre-trained weights. However, during inference without training, I noticed a significant accuracy drop to 0%. Initially, the model achieved 45% accuracy with only one 2D convolutional layer (ViT uses one conv as patch embedding).

Are you finetuning the additional layers or are you evaluating the accuracy with the randomly initialized weights? I’m not sure acceptable accuracy would be expected in this case.

I am assessing the precision using randomly initialized weights. Unexpectedly, the addition of a single layer led to a significant decrease in accuracy, contrary to what was suggested in a “Vision Transformer for Contrastive Clustering” that claimed it would enhance accuracy.

After briefly looking at the paper I couldn’t see where it was indicated that randomly initialized weights were used, could you point to where exactly this is said in the paper?

In general I am having trouble understanding how accuracy could be improved or even maintained if a randomly initialized layer is added to the model without finetuning or weight updates.