Speed up training with lazy loading a lot of data

Hi everyone,

Here is my question:

I have roughly 400,000 training data and each one is stored as a csv (~35 GB in total). I have a custom dataset object that reads these csv files in __getitem__. Currently, each epoch takes roughly 70 minutes with a batch size of 512.

So, I was wondering if there’s anyway to speed up the training without adding additional resources?

Thanks!

You should consider using torch.utils.data.DataLoader, and specify number of workers. These workers retrieve data from the dataset and will significantly improve the read spead. Here is a little snippet:

dataloader = torch.utils.data.DataLoader(dataset,
 batch_size=512,
 shuffle=True,
num_workers=10)

Thanks for the suggestion. I tried it out but whenever I set num_workers to something greater than 1 the VM just freezes (running on GCP instance with 1 GPU). Is this perhaps a problem with memory?

Well, it seems there is an open issue for it: CPU memory gradually leaks when num_workers > 0 in the DataLoader. You can find a diverse set of possible solutions in the aforementioned link.

converting data to binary file format should help (e.g. read csv with pandas, write parsed data with numpy.save, numpy.memmap or torch.save). make sure to write 4-byte floats, unless you train with float64.