I am trying to create a model that understand patterns in human voice and have a lot of voice samples (133K different files overall size 40GB). I run a lot of preprocessing and then generate a feature cube which I want to give to Pytorch model.
So far I have been doing the preprocessing and cube generation offline so that I create the feature cubes and write them to a “*.pt” file using Torch.save().
I have currently only used 5K samples that generated an *.pt file of 1GB (I would then expect a file 26GB big).
I then do a Torch.load on the training host loading everything in memory using TensorDataset and DataLoader
features = torch.load(featpath) labels = torch.load(labpath) dataset = torch.utils.data.TensorDataset(features,labels) return torch.utils.data.DataLoader(dataset,shuffle=True,batch_size=batch_size,num_workers=num_workers)
It worked ok with fewer samples and it probably will work if I give enough memory / disk space (I use S3 and Sagemaker)
I am just wondering am I doing the right thing ? Is there a best practices for large datasets ? Streaming from disk ? or do they have to be all loaded in memory ? I am assuming here that DataSet/Loader do this.