Caching with Dataset

apytorch · March 19, 2020, 7:04pm

First, my dataset class does not modify the data loaded (from HDF files, in this case). On the computers I run my models on, there is not enough ram to hold all of the dataset items.

To speed up loading, I have been caching up to a specific count. Then, in get_item it just tests if the item is cached, and returns that, or loads the item from disk. However, this means that it takes 10+ minutes at the start to load all of the data that it can. I would prefer to lazily load the data as the dataloader calls the dataset.

First attempt was just to have the workers try to update the cache list, but this failed because the object isn’t updated among the workers.

Second attempt, which I got from here, was to use a multiprocessing manager list.

self.manager = Manager()
self.cache = self.manager.list([None]*self.count)

And this seemed to work on simple tests (artificially limiting the number of items, I saw that it lazily cached everything it could, and read from the cache on the subsequent epochs). However, when I let it run on the full set, it failed before finishing the all of the batches for the first epoch.

I expect this is because sometimes two workers simultaneously try to update the shared list. The error raised is “convert_to_error” on the line where the item loaded from disk is saved into list:

self.cache[index] = sample

I’ve searched here and the multiprocessing documentation, but I can’t find any way to add some sort of lock to this line, so that only one worker is writing to the cache at a time.