How to change Spatial Transformer Network to support different image sizes

Following the tutorial:

What would I need to do to adapt the STN network to work with specific B=batch size, H=image height, W=image width, C= channels? I assume B is handled by .view(-1, …) but what about the remaining H, W, C values?

Thank you and my apologies if this is a bit of a beginner question! :slight_smile:

1 Like

You mean the view() op in in stn() and forward()?
If you change the number of input channels, height and width for your input, you would need to adapt the in_features for the linear layers as well or alternatively use adaptive pooling layers to get your desired output size.

As you can see self.localization returns an output of [batch_size, 10, 3, 3]. That are exactly the in_features of self.fc_loc.

The easiest way would be to add print statements into your forward, use your new image shapes and just print out the new shape returned from the layers before the linear layers.

I was reading that tutorial and came across your comment. I can’t understand why self.localization returns 10*3*3 features and not 10 features (like the convolution layer before it)?

I’m not sure, if I’m missing something, but self.localization seems to be the first called block in the model.

Thanks for your response.
I actually had an error when I tested this network on an image of 224x224, which after the two max-pooling layers became a 52x52 image. But I fixed it by replacing 10*3*3 with 10*52*52.