Understanding pytorchs deeplabV3

mayool · June 4, 2020, 10:37am

Hey everyone,

i am currently working with the torchvision.models.segmentation.deeplabv3_resnet50() model.

It consists of:

a backbone (Resnet)
a classifier (DeeplabHead)
interpolation (biliniar to make sure output_size = input_size)

what really confuesed me was the interpolation part.
For testing I inserted an image of size 270x512.
The result of the classifier however was 34x64.

So the model uses bilinear interpolation to upscale from 34x64 --> 270x512 which seems like a massive Jump.
I always though Deeplabs Decoder would upscale the image to something that is at least close to the original size.

Why does this model still peform so good? would it be possible to replace the interpolation by something like conv_transpose2d to improve the result?

or maybe I am just not understanding the decoder part ?

ptrblck · June 5, 2020, 6:57am

The paper explains the interpolation strategy as well as the usage of transposed convolutions in a couple of sections.
This section might be interesting:

We have adopted instead a hybrid approach that strikes a good efficiency/accuracy trade-off, using atrous convolution to increase by a factor of 4 the density of computed feature maps, followed by fast bilinear interpolation by an additional factor of 8 to recover feature maps at the original image resolution. Bilinear interpolation is sufficient in this setting because the class score maps (corresponding to log-probabilities) are quite smooth, as illustrated in Fig. 5.