I have a dataset consisting of large images. I only need to look at a specific area in each image, so I have converted the entire dataset into an hdf5 file and I plan to lazily load that specific area per image.
However, since these areas are different between images, I end up with inputs of different sizes. The suggested way to handle that is to write a custom collate function to pad all the inputs into the same size. The catch is, instead of increasing the size by padding (which adds useless information to my input), I prefer to increase the selected area and just load a larger area from my hdf5 file.
In order to do this, I need to calculate the input size of that batch before I load anything; and for that I need to know which images will be in each batch. However, I don’t have access to this information inside the Dataset.
One solution I can think of is to load the data inside the collate function instead of in the Dataset, but I am worried that this will interfere with the parallelization that is done in DataLoader and slow the training.
Is there a better way to solve this problem? Or is there a way to access all the indices of a batch inside the Dataset?