Are there any solutions (or ideas) how to cache datasets? I have quite a bit of pre-processing in the respective __getitem__
implementation of torch.utils.data.Dataset
which are recalculated on every epoch.
Depends on what you do really. The marshal
module might help (should be faster than pickle
).
Or shelve
, which may be even closer to what you’re looking for
Is there maybe something like a cache proxy which just wraps a custom python class and caches each calls?
There’s functools.lru_cache
in Python 3
Yep. That’s what I was looking for!
@bodokaiser Did you manage to use functools.lru_cache
on __getitem__(self, index)
? I’m having hard time making it work properly
Hello EKami
I ran into the same issue. It works if you define a function outside of the class which you then call for preprocessing/loading.
joblib does an amazing job in caching to disk. This removes a lot of hassle and works great for me:
from joblib import Memory
cachedir = '/data/cache/'
memory = Memory(cachedir, verbose=0, compress=True)
@memory.cache
def preprocess_file(file, ..params ):
... load data and do time consuming pre=processing
sample['data'] = data
sample['labels'] = labels
return sample
class MyData(Dataset):
def __init__(self, data_path, ... parameters):
self.files = list of files
def __getitem__(self, index):
sample = preprocess_file(self.files[index], parameters....)
return sample
Is it possible to cache pytorch models trained on GPU by joblib?
I mean, using joblib to cache trained model instead of saving/loading hassles.