Resizing images to feed into a neural network

I have a semantic segmentation task in hand. I have input images in size: (2056, 2464, 3). The network I am using is “fcn_resnet101”. The input for this model should be 224*224 so I resize my images:

data_transforms = {
‘train’: transforms.Compose([
transforms.Resize((input_size, input_size), Image.NEAREST),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
‘val’: transforms.Compose([
transforms.Resize((input_size, input_size), Image.NEAREST),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])

class Train_Dataset(Dataset):

def __init__(self, imgarray, labelarray, transform=None):

    self.images = imgarray
    self.labels = labelarray
    self.transform = transform

def __getitem__(self, index):
    img = self.images[index]
    if self.transform is not None:
        img = self.transform(img)  
    Img = Image.fromarray(self.labels[index]) #numpy array to PIL
    PilImg = Img.resize((224, 224), Image.NEAREST ) #resize the PIL image
    label = torch.from_numpy(np.asarray(PilImg)) # convert PIL to numpy and then convert the numpyarray to tensor
    #print(img.size(), label.size())
    return img, label

def __len__(self):
    return len(self.images) 

I am getting strange result when training: training accuracy surpasses 100% in the second epoch and validation accuracy stops at 21 percent.

I want to solve this: I assumed that maybe resizing the image is the reason I’m getting this result. Should I split the input images into smaller sized ones (224*224) and then sum the results at the end? or just resizing is the correct approach? I read somewhere that we split a big input image when we can’t read it into memory but we pytorch we have the dataloader so does it mean that it is not necessary to use it? or we need it anyway as resizing means loosing valuable information?


As you have pointed out, resizing such large images to 244*244 would result in loss of many useful visual features because of decrease in resolution. You need these visual cues to train your network.

You need divide your images to patches of size 244*244 and train your model with batches of multiple images, like maybe 4 to 8 patch at each step. And no, dataloader only handles the collation, io and multithreaded loading of the images. İt won’t dispatch the images automatically, as far as I know.

Note that you still have an alternative though. This depends on the scale of your GPU resources, but you may be able to fit the images and your model to your GPU ram at the same time. Just as an experiment, maybe you could try to resize your images to half the size for each dimension to see what happens. İf you still get a memory error, try 1/4. İf you have enough resources, you might be able to fit your images without compromising much of the good information in your training images.

Thank you very much.
In case I have the resources (I haven’t got any memory errors), how should I prepare the images for the model? My model only accepts images in the size of 224*224

You can systematically take 224*244 crops from your images and corresponding ground truths and use them as your new training images

Okay. thank you. I try to do so :slight_smile: