Help building and storing very large datasets?

Goldname · April 7, 2024, 3:34am

Given an embedding model, I want to embed a lot of text and store the embeddings for later downstream tasks. What’s the best way to do this? My embedding model is quite large and the dataset is quite large so they can neither fit into cpu nor gpu memory.

How can I create a pytorch dataset, save it to disk, then continuously append chunks of torch.tensors to it without needing to load the dataset back into memory?

I also don’t want to have hundreds of .pt files. I just want a single large pytorch dataset. Embedding size of 4096 x 4 bytes (float32) x 10000 items so I expect the dataset to be about 200 GB in size.

Goldname · April 7, 2024, 3:59am

Furthermore, after creating this dataset I’d need a good way to read out chunks from it. Could anyone point me to any ways of doing this elegantly?

Goldname · April 7, 2024, 5:33pm

@ptrblck do you have any advice?