How to have dataloader access a larger image in memory to take small crops from instead of reloading that big image everytime?

Hi. When defining how to retrieve my image crops in the get_item function of the dataset, I currently have the generator load an image from a list of images and coordinates. It then loads that image and crops out the coordinates. Many of these images are the same images being loaded again each retrieval just to take crops from different coordinates. Is there a way to store the big image in memory so the generator can just look there if it’s available rather than loading it?

batch_df = self.final_cell_df.loc[self.df_index[index * self.batch_size:(index + 1) * self.batch_size]]

    output = []
    for ind, row in (batch_df).iterrows():
        marker_slice = io.imread(row['%s_Path' % self.marker]) #300,300

        marker_crop = marker_slice[row.y - int(self.crop_size[0] / 2):row.y + int(self.crop_size[0] / 2),
                      row.x - int(self.crop_size[1] / 2):row.x + int(self.crop_size[1] / 2)]

        label = np.array(row.Label)
        output.append((marker_crop, label))

Thank you!

Hi Vivek!

There could be a lot of ways of doing this.

You don’t say how big your “big images” are, how many you have,
whether your coordinates are known in advance or generated on the
fly, or whether you’re running your processing on the gpu or cpu.

Based on the details, you might consider the following approaches:

Plan A: If you know your cropping coordinates in advance, you could
simply preprocess you big images and store all of the cropped sub-images
on disk. (Disk space for this is likely to be your cheapest resource.)

Then just implement the standard Dataset / DataLoader approach
and have the Dataset read the pre-cropped images off disk as needed.

Plan B: If you know that your DataLoader (or equivalent logic) will be
accessing the sub-images from a given big image all in a row (that is,
something like sub-images 1, 2, 3 from big image A followed by sub-images
1, 2, 3, 4, 5 from big image B, and so on), you can simply cache the big
images.

That is, your DataLoader (or equivalent logic) requests, for example,
big image B, sub-image 3. Your code checks whether big image B has
already been read in and is still cached in memory, If not, it reads in
image B into memory (deleting other big images from the cache if
necessary for space reasons). Now image B is in the cache if it wasn’t
already, so your Dataset (or equivalent logic) crops out sub-image 3
and returns it. As long as your DataLoader next requests sub-images
4 and 5 from big image B, big image B will still be in the cache and will
not need to be read in again. Of course, this approach won’t let you use
shuffle = True with your DataLoader – something that you would
typically want to do.

Plan C: Consider pre-loading (or lazy-loading) your big images into cpu
regular ram memory if you have enough space to fit them all. Depending
on your hardware, you may well have a lot more cpu memory than gpu
memory. You still have to crop out your sub-images and move them to
the gpu (or vice versa) which will be slower than if the big image were
already in the gpu, but moving data from the cpu to the gpu is still a lot
cheaper than reading it off the disk.

Plan D: Write a custom Sampler. This approach probably gives you the
most control consistent with reusing the same big image multiple times
before reading in a new one.

Pytorch’s data architecture imagines the following division of labor: The
Dataset knows how to retrieve images (usually from disk) by index or
map key, possibly transforming them along the way. The DataLoader
knows how to iterate over a Dataset, possibly collating the samples
into batches. The Sampler is a kind of glue between the two that lets
you use more elaborate logic when sampling from the Dataset. (For
example, WeightedRandomSampler lets you pick different samples with
different probabilities such as when you want to compensate for a data
imbalance.)

Your custom Sampler would know about big images and sub-images.
It could still shuffle the order in which it selects big images and still
shuffle the sub-images within a big image, but when it reads in a big
image, it would then sample all of that big image’s sub-images before
moving on to the next big image, getting the most benefit out of the
expensive read operation.

Good luck.

K. Frank