How to create convnet for variable size input dimension images

Hey guys, I was wondering if it’s possible to create a convnet with variable size input images as training data. If so can someone provide a simple example on how to achieve sth like that. For instance in tensorflow I would go and simply define the input shape on the first conv layer as (None, None, 3). Can we do sth like that in pytorch?



it depends on what you want to do at the end.

If the convnet is for image classification (you want one output for an image, regardless of size), then you can use nn.AdaptiveAvgPooling right before the fully connected layers.

If you want dense classification (larger image = larger output), you can replace your nn.Linear layers at the end with nn.Conv2d with kernel size 1x1


Hi Soumith , thanks a lot for the answer. Ultimately the goal is to do classification on image ROIs and since they are of different sizes that’s why I was asking. So for instance if we have a training loop like this:

for epoch in epochs:
    for sample in nb_samples:
        outputs = convnet(sample)

where each sample represents a batch and it actually is a list of images [img_1, img_2,...,img_n]. And, img_1, img_2,...,img_n represent ROIs extracted from the original images.

Would that work or do I need to specify sth beforehand in the convnet architecture to make it work with this kind of data?

I’m not quite sure about the benefits of nn.AdaptiveAvgPooling in this case. Would you mind elaborating a little bit. How could it help in terms of making the convnet deal with data where each sample has as variable size.

For instance:
img1.shape = 238, 126, 3
img2.shape = 68, 234, 3
img3.shape = 225, 98, 3
. . .
img_n = ...

1 Like

AdaptiveAvgPooling will take as input a variable sized image, and then downsample it to a fixed size. The way it does it is by adaptively changing the pooling filter size based on the input image size.


Awesome. Thanks, much appreciated! :slight_smile:

@kirk86 Can you share your key code that loading different size input images?

I read the source code of dataloader, finding that torch.stack() is used. This function expects all elements in the batch sequence the exactly same size. How do you handle that?

If I set batch size to greater than 1, how can I use a dataloader of different size inputs?



Same problem here. It would be much more convenient if we can do something like torch.stack with tensor of different sizes.

as a workaround, for batchsize=1, you can manually accumulate loss, do the average, then backward and update weight.


Thanks for the workaround, Jiedong. Is there a more optimized method for GPU utilization? Calling .cuda() on each data example tensor is not improving training performance over the CPU alone.

I might be missing something, but how does a kernel size of 1x1 solve this problem? How do we go from the 1x1 convolution of the preceding feature maps to a classification?

The number of kernels in the last conv layer (out_channels) will specify the number of classes, while the 1x1 kernel does not change the spatial size.
Using this approach you can easily output the class logits for each pixel location e.g. in a segmentation use case.

Thanks! But if we were trying to do simple classification of the entire image, then we would need some more logic here. In particular, if were trying to do NLL against say 10 classes, then perhaps one approach would be to average the values in each of the 10 feature maps of the last layer. I suppose this would be the “obvious” spatial-dimension invariant simple classifier.

Hi smth! How should I do if the spatial size of the input for convolutional layers is the same but the depth size of the input is variable? Does the AdaptiveAvgPooling still work? Thanks!