If you have an application where you can have a potentially huge Dataset (but it’s generated in a sequential/streaming) manner then is there a way to save the Tensor outputs in a streaming manner?
For example, if we want to preprocess a dataset by tokenizing and running it through a BERT transformer and then save the hidden states Output from each layer is it possible to do in an efficient way?
For instance:
'''
Let's say BatchSize = 50 and num_layers = 13 (for BERT). Data-input length(T) = 200
If the dataset has 20000 items. Then we have:
'''
a = torch.Tensor(torch.randn(13,50,200,HDim))
#This array 20000/50 times!
# If we concatenate it it becomes Massive! Even in terms of numpy-arrays!
Is there a way to write out ‘a’ in a streaming way? Using HDF5/Zarr/mmap/PyTables or anything?