I do deep learning on whole slide pathology images (WSI) and thus am restricted to using the only package that exists for loading these large images: openslide. Openslide works by first loading a “header” of a WSI. I can then extract a patch (image) from the WSI. Openslide has an issue where if I extract a patch from the WSI, it gets stored in cache and eventually fills up my RAM. This hasn’t been an issue until now when I am scaling up in a very large way.
My dataloader currently looks like this:
class LibraryLoader(torch.utils.data.Dataset): def __init__(self, transform, lib): (1) *read csv with patch locations and WSI IDs* (2) *open all WSI* def __getitem__(self, index): (a) *call appropriate opened WSI from given ID* (b) *extract img (patch) from WSI* (c) return img def __len__(self): return self.len
loading all the WSI first speeds up my training. loading the WSI before extracting each patch inside getitem would be a lot slower.
Every 250 batches or so, I would like to reload all the WSI, in other words reopen all the headers freeing the cache. ie. I would like to do step (2)
What is a good way to do this?
Thanks in advance