Are there any solutions (or ideas) how to cache datasets? I have quite a bit of pre-processing in the respective __getitem__ implementation of torch.utils.data.Dataset which are recalculated on every epoch.
Depends on what you do really. The marshal module might help (should be faster than pickle).
Or shelve, which may be even closer to what you’re looking for
Is there maybe something like a cache proxy which just wraps a custom python class and caches each calls?
There’s functools.lru_cache in Python 3
Yep. That’s what I was looking for!
@bodokaiser Did you manage to use functools.lru_cache on __getitem__(self, index)? I’m having hard time making it work properly 
Hello EKami
I ran into the same issue. It works if you define a function outside of the class which you then call for preprocessing/loading.
joblib does an amazing job in caching to disk. This removes a lot of hassle and works great for me:
from joblib import Memory
cachedir = '/data/cache/'
memory = Memory(cachedir, verbose=0, compress=True)
@memory.cache
def preprocess_file(file, ..params ):
... load data and do time consuming pre=processing
sample['data'] = data
sample['labels'] = labels
return sample
class MyData(Dataset):
def __init__(self, data_path, ... parameters):
self.files = list of files
def __getitem__(self, index):
sample = preprocess_file(self.files[index], parameters....)
return sample
Is it possible to cache pytorch models trained on GPU by joblib?
I mean, using joblib to cache trained model instead of saving/loading hassles.