I need to encode a large (>10M sentences) file with BERT (sentence embeddings), save it to a file, then load it into a torch.Dataset that will allow indexing in the training framework for my model.
I’m running into trouble in two places:
- Run out of GPU memory when encoding the file with BERT to create one massive tensor that will get saved to a pickle
- Not sure how to load data from a
.pkl
file to aTensorDataset
(ideally using a mechanism which avoids loading the entire.pkl
into memory to create the Dataset).
Is there anything I can do better? Alternative ways of saving the tensors, a different Dataset
? Thanks!