i am currently working with the torchvision.models.segmentation.deeplabv3_resnet50() model.
It consists of:
- a backbone (Resnet)
- a classifier (DeeplabHead)
- interpolation (biliniar to make sure output_size = input_size)
what really confuesed me was the interpolation part.
For testing I inserted an image of size 270x512.
The result of the classifier however was 34x64.
So the model uses bilinear interpolation to upscale from 34x64 --> 270x512 which seems like a massive Jump.
I always though Deeplabs Decoder would upscale the image to something that is at least close to the original size.
Why does this model still peform so good? would it be possible to replace the interpolation by something like conv_transpose2d to improve the result?
or maybe I am just not understanding the decoder part ?