Speed up training with lazy loading a lot of data

rku1999 · August 21, 2021, 6:20pm

Hi everyone,

Here is my question:

I have roughly 400,000 training data and each one is stored as a csv (~35 GB in total). I have a custom dataset object that reads these csv files in __getitem__. Currently, each epoch takes roughly 70 minutes with a batch size of 512.

So, I was wondering if there’s anyway to speed up the training without adding additional resources?

Thanks!

arman-yekkehkhani · August 21, 2021, 7:04pm

You should consider using torch.utils.data.DataLoader, and specify number of workers. These workers retrieve data from the dataset and will significantly improve the read spead. Here is a little snippet:

dataloader = torch.utils.data.DataLoader(dataset,
 batch_size=512,
 shuffle=True,
num_workers=10)

rku1999 · August 23, 2021, 3:46am

Thanks for the suggestion. I tried it out but whenever I set num_workers to something greater than 1 the VM just freezes (running on GCP instance with 1 GPU). Is this perhaps a problem with memory?

arman-yekkehkhani · August 23, 2021, 5:43am

Well, it seems there is an open issue for it: CPU memory gradually leaks when num_workers > 0 in the DataLoader. You can find a diverse set of possible solutions in the aforementioned link.

googlebot · August 23, 2021, 8:13am

converting data to binary file format should help (e.g. read csv with pandas, write parsed data with numpy.save, numpy.memmap or torch.save). make sure to write 4-byte floats, unless you train with float64.