FCN for various image sizes not learning


at first, please excuse my english, I am trying my best.

My aim is to create a fully convolutional net for binary classification which can handle input images of differnt sizes. At the moment I get no error running the code, but the network is not learning. The accuracy stays at around 50% and the loss does not change much.

I took the following measures to be able to use different image sizes:

  • Since a dataloader can not handle variing image sizes by default, I am using a custom colate function to put all image tensors in a list instead of one big tensor:
    def collate_data(batch):
        data = [item[0] for item in batch]
        target = [item[1] for item in batch]
        target = torch.LongTensor(target)
        return [data, target]
  • I can not feed my model a list of tensors, so in my main loop I iterate over the list and feed each image one by one, saving the outputs of the net and storing all in a FloatTensor:
for epoch in range(start_epoch, num_epochs):
    for i, (images, labels) in enumerate(train_loader, 0):
        outputs = torch.Tensor()
        for im in images:
            im = im.unsqueeze(0)
            im = im.to(device)

            # Forward pass
            out = model(im)
            outputs = torch.cat((outputs, out))

        outputs = outputs.reshape(labels.shape[0], -1)
        loss = criterion(outputs, labels)

        # Backward and optimize

I also tried calculating the loss in the inner for-loop, summing and averaging it. I also tried to loss.backward() in the inner for loop and only make the opt step outside the loop. All methods produce the same results.

When I take the very same network architecture but do not use the custom colate fn and the following main loop, everything runs as expected (assuming the images have the same size):

    for i, (images, labels) in enumerate(train_loader, 0):
        images = images.to(device)
        labels = labels.to(device)

        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)

        # Backward and optimize

outputs and labels have the same shapes in both cases before feeding the criterion function.

At the moment I only see one possible cause for the network not learning:
Autograd is not working as I expect in the inner for-loop. Maybe the creation of the new ‘outputs’ tensor does not produce a gradient, that can be backpropagated well.

Do you have any suggestions regarding this issue? Maybe I just have a big error in reasoning.

Thank you very much in advance.

How large are the differences between the image resolutions?
Are you using any nn.BatchNorm layers in your model?
If so, could you remove them or replace them with nn.InstanceNorm?
Since you are feeding the samples one by one, the running estimates might be quite off.

The images are between 200 x 200 px and 500 x 500 px large. For the purpose of testing the changed made to the model to be able to handle different image sizes, I am still resizing all the images to 304 x 304 pixels. So I can eliminate the size differences as the current error source.

I am using a BatchNorm after every conv layer which is not used to reduce the image dimensions (I have conv layers with a stride > 1 instead of pooling in between conv layers)

Conv -> BatchNorm -> Relu -> Conv (Stride > 1) -> Relu
-> Conv -> BatchNorm -> Relu -> Conv (Stride > 1) -> Relu

I will acquaint myself with nn.InstanceNorm and try it out.

Is there a better way instead of feeding every sample one by one with different sample sizes? The only possibility coming to my mind is Zero-padding, but I am not sure how this affects the training performance.

Thank you very much in advance!

Another approach would be to use an adaptive pooling layer, which outputs an activation volume using the spezified output_size. This would allow to feed differently sized input (same size in the batch) through the model and create a fixed sized activation before the linear layer.
However, you would need to group same sized image into batches.

I use adaptive_avg_pool2d to deal with the size differences. And instead of a classical linear layer I use a convolution layer with a kernel size of 1, which acts like a dense layer (as proposed in the paper Network in Network).

The full stucture is like this:

Conv -> BatchNorm -> Relu -> Conv (Stride > 1) -> Relu -> … Conv (64, 64, Kernel = 1) -> BatchNorm -> Relu -> Conv (Stride > 1) -> Relu -> Conv (64, NUM_CLASSES) -> BatchNorm -> Relu -> adaptive_avg_pool2d -> Softmax

This structure is derived from the paper “Striving for simplicity: The all convolutional net” and learns quite well.

But the suggestion to group same sized images is a good idea I might pursue. Still I would like to understand why feeding the samples one by one collecting the outputs does not work. Could you maybe explain

Since you are feeding the samples one by one, the running estimates might be quite off.

a little further?

OK, I see.
Sure, I try to explain it a bit further.
nn.BatchNorm layers are tracking the mean and std of your data by calculating these values from batches and updating their running_mean and running_var, respectively.
The running estimates are updated using the momentum argument as explained in the docs.
The mean and std of a single sample might differ a lot compared to the overall data, so that the running estimates are working like a “smoothing filter” following the sample’s mean and std instead of tracking the dataset stats.
You could try to play around with the momentum to give a little bit less weight to the sample stats and more to the running stats, but generally it has been shown, that BatchNorm performs bad using a small batch size. The GroupNorm paper introduced a new normalization layer, which should perform better using small batch sizes.
However, if you use single instances, I would assume that InstanceNorm would might be a good choice.

Thank you for this clear and presice explanation. I will try InstanceNorm and report the results. If not successfull, I will try to group the batches by image size.