Pass varied sized images as input to a model without resizing

I’ve searched on google and suggested threads here for an answer to this but couldn’t find one.
So, let’s say I have these 4 images as input:

########################
### Images / Dataset ###
Image1 = torch.rand((3, 255, 255))
Image2 = torch.rand((3, 320, 320))
Image3 = torch.rand((3, 320, 320))
Image4 = torch.rand((3, 120, 120))

I tried to create a simple dataset and data loader for this:

def VariedSizedImagesCollate(batch):
    return [item for item in batch]


class Images_X_Dataset(Dataset):
    def __init__(self, ListOfImages):
        self.data = ListOfImages

    def __getitem__(self, index):
        return self.data[index]
    
    def __len__(self):
        return len(self.data)


MyDataset = Images_X_Dataset([Image1, Image2, Image3, Image4])
MyDataLoader = torch.utils.data.DataLoader(dataset = MyDataset, batch_size = 4, shuffle = True, collate_fn = VariedSizedImagesCollate, pin_memory = True)

We can take one batch and put it in the variable MyX
MyX = next(iter(MyDataLoader))

Now we have a simple fully convolutional network (so that the network itself can handle different size pictures without any tricks).

#############################################
### Model to work with Varied Size Images ###
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        
        self.Conv1 = nn.Conv2d(in_channels = 3, out_channels = 32, kernel_size = (7, 7))
        self.Conv2 = nn.Conv2d(in_channels = 32, out_channels = 64, kernel_size = (5, 5))
        self.Decv1 = nn.ConvTranspose2d(in_channels = 64, out_channels = 1, kernel_size = (4, 4))
        self.Sigm1 = nn.Sigmoid()

    def forward(self, X):
        out = self.Conv1(X)
        out = self.Conv2(out)
        out = self.Decv1(out)
        out = self.Sigm1(out)
        return out

model = Net()

Now if I instantiate the model and pass the 1st image of MyX, I will get an output as I should.

MyOutImg = model(MyX[0].unsqueeze(0))
print("Original shape of 1st image:", MyX[0].shape, "Output's 1st image shape: ", MyOutImg[0].shape)

Original shape of 1st image: torch.Size([3, 120, 120]) Output’s 1st image shape: torch.Size([1, 113, 113])

The problem arises when I want to pass the whole batch:

MyOutImg = model(MyX)
print("Original shape of 1st image:", MyX[0].shape, "Output's 1st image shape: ", MyOutImg[0].shape)

TypeError: conv2d(): argument ‘input’ (position 1) must be Tensor, not list

Which I understand. It couldn’t be a tensor (as far as I understand) because of the fact that pictures have different sizes and tensors require same sizes for all pictures to have N pictures in a tensor (NxCxHxW), so VariedSizedImagesCollate() does return a list.

I tried torch.cat and torch.stack but they both seem to require same size images. So what’s a way where I can pass the whole batch?
I also need to it work on the backpropagation as well, but I guess if it works in the forward it should work on back too.

You can pad the images to the same size if you truly want to avoid resizing.

There are two fundamental as to why variable sized images cannot be combined in a batch:
(1) When trying to stack or concat the tensors when their spatial dimensions are not the same, what should the output tensor shape be? How can the spatial dimensions be unified as there is a single output tensor?

(2) Even suspending the first issue for a moment and considering the possibility that a tensor could exist with different spatial dimensions for different batch indices, this would wreak havoc in terms of batch-level parallelization. This fundamentally introduces data-dependent control flow, which means algorithms that are written to repeat the same computation (e.g., sliding a filter across multiple input examples) have to be modified to consider the potential different dimensions of each input.

For the 1st point, yeah, I can imagine.
But because I’ve seen papers where they use different sizes of images as input, they explicitly say that they don’t resize, but they don’t mension how they handle the different sizes in the end.

So I imagined, since I can use this network with different sizes 1 by 1, just not in a batch, that perhaps since that’s possible, there would be a way to make a list-like kind of batch instead of a torch tensor.
You know what I mean?
Certainly it’s impossible with tensors, but perhaps there’s another way?

On the 2nd point, for this list-like batch, I don’t think there will be any fundamental change since predicting with different size images on this network already works, so long as you do it 1 picture at a time. Just, instead of iterating on the 1st dimension of N x C x H x W tensor, you iterate over the length of the list with C x H_i x W_i tensors as elements of the list

(Now if a picture is too small for the current architecture, ofc that will throw an error)

The issue is that the underlying implementations on CPU and GPU are usually written to take advantage of batching for data reuse. Of course, you can always write your own function that takes in a list of tensors, one per each image. The performance will likely be much lower than batching (even if it means zero-padding is used).

I am referring to changes in the underlying kernel implementations. These kernels are highly shape dependent (e.g., the best axis for data reuse changes depending on the relative sizes) of which batch size is an important parameter.

I see.

Well, in that case, I guess the best I can do is abandon hope for list of 3D tensors and pad the images to create 1 actual 4D tensor.

Perhaps I could sort by Height and width so that each batch can have a different amount of padding, as little as possible for it to get to the same size as the biggest picture in each current batch.