Use a huge file with pre-calculated tensors for training?

josmi9966 · November 13, 2017, 8:51pm

I want to train a net using a training set which does not fit into memory. But for implementing
a Dataset subclass, it must be possible to access an arbitrary element in the training set directly.

I have seen mention of torch.FloatStorage which apparently allows to memory map a large number of tensors.

Could this be used to solve my problem? Ideally I would preprocess the huge dataset in its original format
by going through it several times, then at the final pass I would create the input/output tensors as I need them.
Since each example is represented by the same constant number k of elements, I should be able to then
append those k values to the tensor Storage incrementally, growing the memory mapped FloatStorage … somehow.
Finally for training, I could wrap the FloatStorage into a Dataset subclass.

However, I do not really understand how I would be able to create the FloatStorage initially and incrementally
add the tensor elements … is there some example code for this? Is it possible to allocate some initial size for the
Storage and then extend it if necessary?

is there a standard way for how to do this or some example code out there?