Am I using Huggingface correctly

Hello,

I am working on a project to see how some models/architectures perform with my custom dataset for semantic segmentation. I want to train my models from scratch with no pre trained weights. I am comparing models like ResNet50, Segfromer and Mask2Former. I load all my images and masks using DataLoader. For ResNet50, I use the ResNet50 FCN provided by torchvision: fcn_resnet50 — Torchvision main documentation.

For Segformer, I found that huggingface provides a Segformer model. So I am just using that. The performance isn’t that great so I am wondering if I am doing something wrong. Is it fine if we use huggingface models without using any other huggingface methods like AutoImageProcessor?
I load the model using:

configuration = SegformerConfig(**dict(arch['args'])) 
configuration.num_labels = data['num_classes']
model = SegformerForSemanticSegmentation(configuration)

Then I get the results using:

for _, images, masks in dataloader:
    images = images.to(self.device, non_blocking=True)
    masks = masks.to(self.device, non_blocking=True)
    outputs = model(pixel_values=images, labels=masks).logits
    outputs = nn.functional.interpolate(outputs, size=masks.shape[-2:], mode="bilinear", align_corners=False)

I am also trying to use Mask2Former to train from scratch, but for post processing I need to use Mask2FormerImageProcessor to get the semantic segmentation. I already have processed my images in my DataLoader. What do I do here to just use Mask2Former with my own data?

So for Mask2Former. I am doing this

    configuration = Mask2FormerConfig(**dict(arch['args']))
    configuration.num_queries = data['num_classes']
    model = Mask2FormerForUniversalSegmentation(configuration)
for _, images, masks in dataloader:
    images = images.to(self.device, non_blocking=True)
    masks = masks.to(self.device, non_blocking=True)
    outputs = model(images,
                mask_labels=masks)
        outputs = outputs.masks_queries_logits
        outputs = nn.functional.interpolate(outputs, size=masks.shape[-2:], mode="bilinear", align_corners=False)
align_corners=False)

Is this the correct way to use Mask2Former (model only) for semantic segmentation? Its not performing that well…

Thanks.

You might want to post this question in the HuggingFace forum as it seems to be quite HF-specific.