Encoding a large dataset with sentence BERT + loading the saved embeddings to TensorDataset?

st-vincent1 · July 26, 2022, 11:03am

I need to encode a large (>10M sentences) file with BERT (sentence embeddings), save it to a file, then load it into a torch.Dataset that will allow indexing in the training framework for my model.

I’m running into trouble in two places:

Run out of GPU memory when encoding the file with BERT to create one massive tensor that will get saved to a pickle
Not sure how to load data from a .pkl file to a TensorDataset (ideally using a mechanism which avoids loading the entire .pkl into memory to create the Dataset).

Is there anything I can do better? Alternative ways of saving the tensors, a different Dataset? Thanks!