If you have an application where you can have a potentially huge Dataset (but it’s generated in a sequential/streaming) manner then is there a way to save the Tensor outputs in a streaming manner?
For example, if we want to preprocess a dataset by tokenizing and running it through a BERT transformer and then save the hidden states Output from each layer is it possible to do in an efficient way?
''' Let's say BatchSize = 50 and num_layers = 13 (for BERT). Data-input length(T) = 200 If the dataset has 20000 items. Then we have: ''' a = torch.Tensor(torch.randn(13,50,200,HDim)) #This array 20000/50 times! # If we concatenate it it becomes Massive! Even in terms of numpy-arrays!
Is there a way to write out ‘a’ in a streaming way? Using HDF5/Zarr/mmap/PyTables or anything?