Cache datasets pre-processing

bodokaiser · March 14, 2017, 9:33pm

Are there any solutions (or ideas) how to cache datasets? I have quite a bit of pre-processing in the respective __getitem__ implementation of torch.utils.data.Dataset which are recalculated on every epoch.

apaszke · March 14, 2017, 10:34pm

Depends on what you do really. The marshal module might help (should be faster than pickle).

jekbradbury · March 14, 2017, 10:54pm

Or shelve, which may be even closer to what you’re looking for

bodokaiser · March 15, 2017, 7:36am

Is there maybe something like a cache proxy which just wraps a custom python class and caches each calls?

apaszke · March 15, 2017, 11:03pm

There’s functools.lru_cache in Python 3

bodokaiser · March 16, 2017, 9:59am

Yep. That’s what I was looking for!

EKami · September 7, 2017, 4:13pm

@bodokaiser Did you manage to use functools.lru_cache on __getitem__(self, index)? I’m having hard time making it work properly

gonzales2010 · October 12, 2018, 4:29pm

Hello EKami
I ran into the same issue. It works if you define a function outside of the class which you then call for preprocessing/loading.

joblib does an amazing job in caching to disk. This removes a lot of hassle and works great for me:

from joblib import Memory
cachedir = '/data/cache/'
memory = Memory(cachedir, verbose=0, compress=True)

@memory.cache
def preprocess_file(file, ..params ):
    ... load data and do time consuming pre=processing
    sample['data'] = data
    sample['labels'] = labels
    return sample


class MyData(Dataset):
    def __init__(self, data_path, ... parameters):
        self.files  = list of files

    def __getitem__(self, index):
        sample = preprocess_file(self.files[index],  parameters....)     
        return sample

Amirhj · December 17, 2018, 3:45pm

Is it possible to cache pytorch models trained on GPU by joblib?
I mean, using joblib to cache trained model instead of saving/loading hassles.