Lazily loading inputs of different sizes without padding

MHSA · March 2, 2022, 2:27pm

Hello,

I have a dataset consisting of large images. I only need to look at a specific area in each image, so I have converted the entire dataset into an hdf5 file and I plan to lazily load that specific area per image.

However, since these areas are different between images, I end up with inputs of different sizes. The suggested way to handle that is to write a custom collate function to pad all the inputs into the same size. The catch is, instead of increasing the size by padding (which adds useless information to my input), I prefer to increase the selected area and just load a larger area from my hdf5 file.

In order to do this, I need to calculate the input size of that batch before I load anything; and for that I need to know which images will be in each batch. However, I don’t have access to this information inside the Dataset.

One solution I can think of is to load the data inside the collate function instead of in the Dataset, but I am worried that this will interfere with the parallelization that is done in DataLoader and slow the training.

Is there a better way to solve this problem? Or is there a way to access all the indices of a batch inside the Dataset?

Thanks!

ptrblck · March 3, 2022, 2:38am

That’s correct in the default use case. However, you could use a BatchSampler, which would allow you to get all indices of the current batch in the __getitem__ and apply your resizing logic as described e.g. here.