Suppose I’m using the Dataset and Dataloader classes to load vectors (e.g. sentence embeddings trained from another model) on-the-fly (meaning I only load e.g. batchsize=16 embeddings at a time for the batch) for training my model. I might want to do this if the number of embeddings I have is large and I can’t fit it all into memory.
What is the best format to save the sentence embeddings in order to speed up the on-the-fly loading and hence training? What is the best way to load them?
The embeddings have size of 1024 dimensions. They can be thought of as being convertible to numpy arrays or torch tensors. I think the values of the embedding are quite random since they’re the latent vector of another model.
I think the best way would be to use torch.save and torch.load.
But as IO is extremely slow I think you should avoid loading batches 1 per 1.
Even if your entire dataset cannot fit into memory, I’m sure you can fit more than only one batch.
Here is how I do it in my Dataset class
On my disk I have saved several files containing my training examples, each file contains 250 000 examples and it’s just a single tensor of dims (250000,firstDim,secondDim…) saved using torch.save.
My Dataset loads only one of these files using torch.load, and when it is asked for a batch, I simply select the batch from my giant tensor using tensor.narrow and I clone the result into GPU memory (to ensure minimal GPU usage, only one batch is loaded into GPU). When the dataloader run out of batches to select from the giant tensor, it simply load it again from a different file.
I found that this method is really efficient because I/O is minimal, and you can really tune it depending on how much RAM you want to use.
This makes sense except you can never get fully random data shuffling for each epoch, no?
Yes exactly, what I do is that I shuffle the file load order between each epoch and in my case it’s good enough. If you want something closer to fully random data shuffling, you could shuffle the big tensor just after you load it from the file (following Shuffling a Tensor ). I think it should be sufficient.
Ya seems reasonable. Do you have a rough sense of the relative loading time overhead in all 3 methods?
- Loading batch by batch
- Loading groups of batches
- Loading entire dataset at once (if it hypothetically did fit in memory)
I guess I’m thinking in terms of X times slowdown for 1 and 2 versus 3. Maybe just some rough guesses would be nice.