Batching images in each folder together

Hi!

I have a bunch of training images which I have organized into folders grouped by image size (200x600 images in 200x600 folder).

When initalizing my Dataloader and passing the batch size, I’d like it that images who share the same folder are processed together in one batch. Currently what’s happening is that images of different sizes are being batched together and processed leads to an error.

I do not know how to achieve my goal and would appreciate any feedback. If there any questions to my issue then please let me know. Below is the code for my custom Dataset class which may be relevant.

class ImageDataset:
  def __init__(self, directory: str):
    self.files: list[str] = glob(os.path.join(directory, "*/*.png"), recursive=False)

  def __len__(self):
    return len(self.files)
  
  def __getitem__(self, item_id: int) -> tuple[npt.NDArray, npt.NDArray]:
    with open(self.files[item_id], "rb") as fh:
        with Image.open(fh) as image:
            target_data = np.array(image)
            input_data = np.array(image.resize((int(target_data.shape[1]/4), int(target_data.shape[0]/4)), Image.BILINEAR))
            input_data = input_data.transpose(2, 1, 0)
            target_data = target_data.transpose(2, 1, 0)

    return input_data, target_data

One option is to write a custom BatchSampler that ensures all indices within a batch belong to the same folder. Then, you can pass that to DataLoader.