Lazy loading of wide dataset

Kasper_Rasmussen · January 11, 2022, 3:40pm

Hi Pytorch community,

I am training a model on a very wide dataset (~500,000 features). To read the data from disc I use dask to load an xarray.core.dataarray.DataArray object to not load all the data in memory at once. I can load subsets of the data into memory with a numpy array as such: xarray[0:64,:].values. This loads 64 samples into memory in about 2 seconds. I then want to feed the data to the model one batch at a time, using batch size of 64, since this should have an estimated epoch time of 3 minutes given the total sample size (~6000). My problem is when I try to implement this functionality in the pytorch Datasetmodule. I want to create a class in which a DataLoader loads the data from disc into memory one batch at a time with a specified batch size. To that I made the following Datasetclass and wrapped it in a DataLoader.


class load_wide(Dataset):

    def __init__(self, xarray, labels):
        
        self.xarray = xarray
        self.labels = labels

    def __getitem__(self, item):
        
        data = self.xarray[item,:].values
        labels = self.labels[item]
        
        return data, labels

    def __len__(self):
        return len(self.xarray)

I then load the data like this:

train_ds = load_plink(xarray_train, labels_train)
train_dl = DataLoader(train_ds, batch_size=64, shuffle=True)

for i, data in enumerate(train_dl):
    feats, labels = data
    preds = net(feats)
    loss = criterion(preds, labels)

This works fine, but it takes about 90 seconds to load a batch resulting in unreasonable training time. What went wrong in my implementation of dataloading, that causes it to load the data so slowly. Using the DataLoader causes a 45-fold slowdown of dataloading. Can someone explain the cause of this?
Also, how can you simultaneously evaluate the model on a validation set that is loaded one batch at a time?

ptrblck · January 13, 2022, 6:57am

I’m not familiar with the internal implementation of the xarray, but what seems to be different is the shuffling. Could you test your simple code snippet with random indices xarray[0:64,:].values instead of contiguous ones and compare the loading speed?

Kasper_Rasmussen · January 13, 2022, 12:21pm

This was the exact cause of the issue. It seems like xarray is not a good fit to combine with pytorch’s dataloader class. I will look for alternative ways of loading the data. Thank you!

JinroTorch · March 29, 2024, 1:43am

Hello! Please refer to this, maybe this could help you as it helped me.