I am trying to create a model that understand patterns in human voice and have a lot of voice samples (133K different files overall size 40GB). I run a lot of preprocessing and then generate a feature cube which I want to give to Pytorch model.
So far I have been doing the preprocessing and cube generation offline so that I create the feature cubes and write them to a “*.pt” file using Torch.save().
I have currently only used 5K samples that generated an *.pt file of 1GB (I would then expect a file 26GB big).
I then do a Torch.load on the training host loading everything in memory using TensorDataset and DataLoader
features = torch.load(featpath)
labels = torch.load(labpath)
dataset = torch.utils.data.TensorDataset(features,labels)
It worked ok with fewer samples and it probably will work if I give enough memory / disk space (I use S3 and Sagemaker)
I am just wondering am I doing the right thing ? Is there a best practices for large datasets ? Streaming from disk ? or do they have to be all loaded in memory ? I am assuming here that DataSet/Loader do this.
As long as you have enough memory to do so, there’s no problem with what you are doing. If the data were to get too large, the easiest solution would be to split the samples into individual files and only load them in a custom Dataset
Thanks just to clarify.
The getitem will have to flip from one to the other when one has finished the cycle ?
This means that for every epoch I would be flipping from one to another and back to the first a the beginning of next epoch?
Would be nice if there was an example but this seems reasonable in case I run out of memory. I still do not have a lot of good handle on Datasets.
Thanks you again.
Well, to avoid unnecessary pain, you would split the data so that each sample tensor will have its own
.pt file. Then, in the
__getitem__, you call the load, like this
def __init__(self, root):
self.root = root
self.files = os.listdir(root) # take all files in the root directory
def __getitem__(self, idx):
sample, label = torch.load(os.path.join(self.root, self.files[idx])) # load the features of this sample
return sample, label
And then use this dataset in the same way. Keeping the data in several chunks rather than individual samples would be … problematic.
Thanks, hopefully I will fit in memory but this seems a pretty simple solution.
Would there be any issue with Shuffling and DataLoader ? I wonder how would the shuffling work in this case.
With shuffling enabled, the dataloader randomizes the
idx parameter of
__getitem__, effectively choosing a random file each time.