"refreshing" dataloader

marx_was_right · May 29, 2018, 5:28pm

Hi all,

I do deep learning on whole slide pathology images (WSI) and thus am restricted to using the only package that exists for loading these large images: openslide. Openslide works by first loading a “header” of a WSI. I can then extract a patch (image) from the WSI. Openslide has an issue where if I extract a patch from the WSI, it gets stored in cache and eventually fills up my RAM. This hasn’t been an issue until now when I am scaling up in a very large way.

My dataloader currently looks like this:

class LibraryLoader(torch.utils.data.Dataset):
    def __init__(self, transform, lib):
       (1) *read csv with patch locations and WSI IDs*
       (2) *open all WSI*
    def __getitem__(self, index):
        (a) *call appropriate opened WSI from given ID*
        (b) *extract img (patch) from WSI*
        (c) return img
    def __len__(self):
        return self.len

loading all the WSI first speeds up my training. loading the WSI before extracting each patch inside getitem would be a lot slower.

Every 250 batches or so, I would like to reload all the WSI, in other words reopen all the headers freeing the cache. ie. I would like to do step (2)

What is a good way to do this?

Thanks in advance

justusschock · May 29, 2018, 5:35pm

The easiest way would be to introduce a new variable to the dataset (sample_counter) and increment it in every call of __getitem__. If you additionally pass the batchsize as argument to the __init__ you can do something like:

if not (self.sample_counter%self.batchsize):
    reload_headers()

marx_was_right · May 29, 2018, 6:15pm

Thanks for your reply @justusschock

I used your idea and I think I got it working. Fluid training with no memory crash yet