Running PyTorch Lightning API with multiple Dataloaders because DataFrame too large

eTuDpy · April 11, 2022, 2:54pm

This seems like an easy question to solve, however, I did not find a working solution for it yet.

I have a very big time series dataset, which if put into one single dataloader will exceed the GPUs memory by far!

My general idea is to have a double for loop. First loop over the DataFrame, take a part of it, transform it into a dataloader and pass it into the second loop to run through as set set of epochs (and then a third for the batches), training and validating one and the same model.

If this works, the question occurs how to train the model. First I thought calling trainer.fit() multiple times, however, this seems not to work. Then I thought about writing a custom optimization loop in the classical PyTorch approach. However, the model I am using is from PyTroch Forecasting and runs on PyTorch Lightning, so I need to use the Lightning API. How do I do that?

Can I solve this by writing a custom training_step(batch, batch_idx, data_loader_idx)? But if doing so, how can I correctly implement it to run on multiple dataloaders?

FYI: My DataFrame is so large, that manually splitting it into dataloader_i does not work. The For-Loop over the DataFrame is manditory.

nivek · April 11, 2022, 3:32pm

Are you using the standard Dataset? Have you tried using IterableDataset or the new IterDataPipe?

One possible solution is to have your data on disk, then lazily load each sample as needed within an IterDataPipe, this should not require having your entire dataset in memory.

eTuDpy · April 12, 2022, 5:46am

I am using the TimeSeriesDataSet from PyTorch Forecasting.

So from DataFrame to TimeSeriesDataSet to DataLoader.

I am not quite sure if it will work with IterDataPipe. I’ll look into it.