Streaming I/O of potentially large tensors?

ankitvad · September 16, 2021, 1:47am

If you have an application where you can have a potentially huge Dataset (but it’s generated in a sequential/streaming) manner then is there a way to save the Tensor outputs in a streaming manner?

For example, if we want to preprocess a dataset by tokenizing and running it through a BERT transformer and then save the hidden states Output from each layer is it possible to do in an efficient way?

For instance:

'''
Let's say BatchSize = 50 and num_layers = 13 (for BERT). Data-input length(T) = 200
If the dataset has 20000 items. Then we have:
'''
a = torch.Tensor(torch.randn(13,50,200,HDim))
#This array 20000/50 times!
# If we concatenate it it becomes Massive! Even in terms of numpy-arrays!

Is there a way to write out ‘a’ in a streaming way? Using HDF5/Zarr/mmap/PyTables or anything?

Huang587 · September 16, 2021, 5:52am

The specialisations you mention are specific types used to provide a standard interface to a variety of data sources. For example, a FileInputStream and an ObjectInputStream will both implement the InputStream interface, but will operate on Files and Objects respectively.

ankitvad · October 20, 2021, 2:17pm

I was able to solve this issue by using - zarr. It is quite fast and convenient depending on the dataset type.