ResNet as backbone/feature extractor - undesired output dimensions

Zador · October 11, 2021, 8:16am

Hi, I have been attempting to leverage the pre-trained ResNet model as a feature extractor. I have removed the final fc and pooling stages of the network and the output shape is (1, 2048, 7, 7).
This is not the type of features I want. What I want is a feature per pixel, i.e. an output of the following shape: (1, n_features, H, W) where H and W are the height and width of the input image.

Is there any intermediate layer which outputs this kind of shape in ResNet? Is there any other pre-trained model in pytorch which offers this?

ptrblck · October 12, 2021, 6:04am

I don’t think you could use ResNets for your use case, as the first conv layer would already reduce the spatial size (layer definition here):

conv1 = nn.Conv2d(3, 6, kernel_size=7, stride=2, padding=3, bias=False)
x = torch.randn(1, 3, 224, 224)
out = conv1(x)
out.shape
> torch.Size([1, 6, 112, 112])

I would guess that the majority of CNNs are downsampling the spatial size in the feature extractor stage and don’t know a specific model, which would keep the spatial size throughout the model.

Zador · October 12, 2021, 7:04am

Thanks for your response! That is sad to hear. Do you perhaps know of any networks that have been used to upsample resnet (or any other pre-trained network) features?

ptrblck · October 12, 2021, 8:00am

I think you might find interesting model architectures when searching for segmentation models, as they are often returning the original input shape (e.g. UNets).