Accuracy Drop in ViT with Patch Embedding: Investigating the Impact of Added Convolutional Layers

External_happy · November 26, 2023, 6:22pm

Hello, I’m currently working on incorporating a patch embedding layer into my Vision Transformer (ViT). I’ve defined this layer using 3 2D convolutional and initialized it with a normal distribution. The remaining layers of the model use pre-trained weights. However, during inference without training, I noticed a significant accuracy drop to 0%. Initially, the model achieved 45% accuracy with only one 2D convolutional layer (ViT uses one conv as patch embedding).

eqy · November 26, 2023, 10:05pm

Are you finetuning the additional layers or are you evaluating the accuracy with the randomly initialized weights? I’m not sure acceptable accuracy would be expected in this case.

External_happy · November 27, 2023, 8:13am

I am assessing the precision using randomly initialized weights. Unexpectedly, the addition of a single layer led to a significant decrease in accuracy, contrary to what was suggested in a “Vision Transformer for Contrastive Clustering” that claimed it would enhance accuracy.

eqy · November 27, 2023, 8:46pm

After briefly looking at the paper I couldn’t see where it was indicated that randomly initialized weights were used, could you point to where exactly this is said in the paper?

In general I am having trouble understanding how accuracy could be improved or even maintained if a randomly initialized layer is added to the model without finetuning or weight updates.