Managing large datasets

as2475 · July 6, 2023, 3:04pm

I have ~0.5 million files from which I am extracting features as torch.tensors with dimensions of approximately 100x100x100. Prrecomputing features for all files doesn’t seem like an option because I don’t have enough storage space. On the other hand, it takes ~10s to calculate the features per file so I’m worried this will be a bottleneck in training if I don’t precompute them.

I would be grateful if anyone would be willing to suggest what would be the standard/best practice in this case or point me to a relevant source?

thank you!

yiftach · July 9, 2023, 7:14am

Hi @as2475, just to make sure I understand the situation - your initial files are smaller than the features you later compute? (your files are saved on the disk, but you don’t have enough storage for the features?)
Eventually any continuously collected dataset becomes too large to be stored on a single machine, and it may then be the time to switch to a streaming based solution. But If you can afford a storage large enough to store them all that is an easier option.