Cache datasets pre-processing

Are there any solutions (or ideas) how to cache datasets? I have quite a bit of pre-processing in the respective __getitem__ implementation of torch.utils.data.Dataset which are recalculated on every epoch.

1 Like

Depends on what you do really. The marshal module might help (should be faster than pickle).

Or shelve, which may be even closer to what you’re looking for

Is there maybe something like a cache proxy which just wraps a custom python class and caches each calls?

There’s functools.lru_cache in Python 3

1 Like

Yep. That’s what I was looking for!

1 Like

@bodokaiser Did you manage to use functools.lru_cache on __getitem__(self, index)? I’m having hard time making it work properly :frowning:

Hello EKami
I ran into the same issue. It works if you define a function outside of the class which you then call for preprocessing/loading.

joblib does an amazing job in caching to disk. This removes a lot of hassle and works great for me:

from joblib import Memory
cachedir = '/data/cache/'
memory = Memory(cachedir, verbose=0, compress=True)

@memory.cache
def preprocess_file(file, ..params ):
    ... load data and do time consuming pre=processing
    sample['data'] = data
    sample['labels'] = labels
    return sample


class MyData(Dataset):
    def __init__(self, data_path, ... parameters):
        self.files  = list of files

    def __getitem__(self, index):
        sample = preprocess_file(self.files[index],  parameters....)     
        return sample
3 Likes

Is it possible to cache pytorch models trained on GPU by joblib?
I mean, using joblib to cache trained model instead of saving/loading hassles.