Best way to load a lot of training data

Hello,
I am trying to create a model that understand patterns in human voice and have a lot of voice samples (133K different files overall size 40GB). I run a lot of preprocessing and then generate a feature cube which I want to give to Pytorch model.

So far I have been doing the preprocessing and cube generation offline so that I create the feature cubes and write them to a “*.pt” file using Torch.save().
I have currently only used 5K samples that generated an *.pt file of 1GB (I would then expect a file 26GB big).
I then do a Torch.load on the training host loading everything in memory using TensorDataset and DataLoader

features = torch.load(featpath)
labels   =  torch.load(labpath)
dataset = torch.utils.data.TensorDataset(features,labels)
return torch.utils.data.DataLoader(dataset,shuffle=True,batch_size=batch_size,num_workers=num_workers)

It worked ok with fewer samples and it probably will work if I give enough memory / disk space (I use S3 and Sagemaker)

I am just wondering am I doing the right thing ? Is there a best practices for large datasets ? Streaming from disk ? or do they have to be all loaded in memory ? I am assuming here that DataSet/Loader do this.

As long as you have enough memory to do so, there’s no problem with what you are doing. If the data were to get too large, the easiest solution would be to split the samples into individual files and only load them in a custom Dataset __getitem__ method.

Thanks just to clarify.
The getitem will have to flip from one to the other when one has finished the cycle ?
This means that for every epoch I would be flipping from one to another and back to the first a the beginning of next epoch?
Would be nice if there was an example but this seems reasonable in case I run out of memory. I still do not have a lot of good handle on Datasets.

Thanks you again.

Well, to avoid unnecessary pain, you would split the data so that each sample tensor will have its own .pt file. Then, in the __getitem__, you call the load, like this

class MyDataset(torch.util.data.Dataset):
   def __init__(self, root):
        self.root = root
        self.files = os.listdir(root) # take all files in the root directory
   def __len__(self):
        return len(self.files)
   def __getitem__(self, idx):
        sample, label = torch.load(os.path.join(self.root, self.files[idx])) # load the features of this sample
        return sample, label

And then use this dataset in the same way. Keeping the data in several chunks rather than individual samples would be … problematic.

2 Likes

Thanks, hopefully I will fit in memory but this seems a pretty simple solution.
Would there be any issue with Shuffling and DataLoader ? I wonder how would the shuffling work in this case.

Thanks again.

With shuffling enabled, the dataloader randomizes the idx parameter of __getitem__, effectively choosing a random file each time.