Hi all,
I do deep learning on whole slide pathology images (WSI) and thus am restricted to using the only package that exists for loading these large images: openslide. Openslide works by first loading a “header” of a WSI. I can then extract a patch (image) from the WSI. Openslide has an issue where if I extract a patch from the WSI, it gets stored in cache and eventually fills up my RAM. This hasn’t been an issue until now when I am scaling up in a very large way.
My dataloader currently looks like this:
class LibraryLoader(torch.utils.data.Dataset):
def __init__(self, transform, lib):
(1) *read csv with patch locations and WSI IDs*
(2) *open all WSI*
def __getitem__(self, index):
(a) *call appropriate opened WSI from given ID*
(b) *extract img (patch) from WSI*
(c) return img
def __len__(self):
return self.len
loading all the WSI first speeds up my training. loading the WSI before extracting each patch inside getitem would be a lot slower.
Every 250 batches or so, I would like to reload all the WSI, in other words reopen all the headers freeing the cache. ie. I would like to do step (2)
What is a good way to do this?
Thanks in advance