Loading big dataset (bigger than memory) using pytorch

bkuriach · June 10, 2021, 7:29pm

I have some data which is thrice large as my system’s RAM. I need to run some Deep Learning models using pytorch. Could you please advise how can I use torch data loaders (or alternative) in this scenario?

Assume my data is stored as subfolders inside parent directory as below.

Transaction_Data/
              ---Customer1/
                          --- day1
                          --- day2
                          .
                          --- dayN
              ---Customer2/
                          --- day1
                          --- day2
                          .
                          --- dayN
             ---CustomerN/
                          --- day1
                          --- day2
                          .
                          --- dayN

Lets assume these are clean data and each customer is like an individual data frame (days represent rows).

I want to load these data in batches (probably 5 customers at once - can fit into memory) and train DL model using torch. What is the efficient way to load these?

I need to iterate over these data and run DL models. Should I be using custom dataset? Any pointers would be really appreciated.

Thank you very much in advance!

eqy · June 10, 2021, 7:45pm

I don’t think you need to do anything special for this scenario as usually the amount of memory required is some small multiple of the batch_size (due to prefetching). For example ImageNet spans hundreds of gigabytes (even more when converted to uncompressed RGB format) yet only a small fraction of this (host and device) memory is required for training.

bkuriach · June 10, 2021, 8:05pm

@eqy Thank you for your response!
I understand that we will need only small memory to fit those batches, but how will I load the data in the first place given the above folder structure?

eqy · June 10, 2021, 8:18pm

It is difficult to understand what your data organization is without knowing what the meaning of data1 and data2 are. Are they different classes? Different datasets? If they are different classes, then a DatasetFolder or ImageFolder is exactly the abstraction for this this scenario: torchvision.datasets — Torchvision 0.8.1 documentation (pytorch.org)

bkuriach · June 11, 2021, 6:06am

No, they are not different classes. Hope below sample structure explains it better.

Transaction_Data/
              ---Customer1/
                          --- day1
                          --- day2
                          .
                          --- dayN
              ---Customer2/
                          --- day1
                          --- day2
                          .
                          --- dayN
             ---CustomerN/
                          --- day1
                          --- day2
                          .
                          --- dayN

Lets assume these are clean data and each customer is like an individual data frame (days represent rows).

I want to load these data in batches (probably 5 customers at once - can fit into memory) and train DL model using torch. What is the efficient way to load these?

Thank you very much in advance!

eqy · June 11, 2021, 6:29am

You can start by taking a look at the default dataset classes: torch.utils.data — PyTorch 1.8.1 documentation
and seeing if your data fits the map style of iterable style abstraction. The map style is usually a straightforward abstraction for many datasets as you only need to define an __getitem__ and a __len__ function. Once you have a usable dataset, using a dataloader torch.utils.data.dataloader — PyTorch 1.8.1 documentation will handle the parallelization and loading in memory for you.

bkuriach · June 11, 2021, 6:41am

@eqy Sure I will check those. Thank you very much!!