Pre-trained ViT as backbone for segmentation network

I am trying to replicate the Segmenter architecture [1] and use it over the TuSimple lane detection dataset. The segmenter model i adapted to my needs consists of a ViT B-16 Backbone network and an MLP (3 layers) classification head, which i use to identify the presence of the lane class on the output patches (as embeddings) from the ViT. Then i planned to interpolate the patch level predictions to pixel level as the authors did in the original paper.

As pre-processing i resized the TuSimple dataset to 640x640 images (from the original 1280x720) along with their respective ground truths by bilinear interpolation. For ViT i did load pre-trained weights and interpolated(using the ‘nearest’ mode) the pretrained positional embeddings to match my new image input dimensions.

To my understanding, I then had to fine-tune the pipeline by training the patch classification head at first to a reasonable performance and then if needed unfreeze some transformer layers and fine tune the pre-trained ViT also. However, I am struggling to even get a 0.01 F1 or IoU score after 10-20 epochs of training the classification head using a learning rate of 0.1 at first (as i did attempt to overfit the model as a first experiment to see if i am on the right track).

What could I be doing wrong?

Thanks in advance

p.s. for any code dont hesitate to ask me
ps2. I am using the BCELosswithLogics() loss function with pos weight = 1900 for tha lane class(after calculation of total amount of background pixels/lane pixels)
ps3. Pre-trained weights from

[1] [2105.05633] Segmenter: Transformer for Semantic Segmentation