How to use dataset larger than memory?

Hubert · February 20, 2019, 7:30pm

I have a dataset consisting of 1 large file which is larger than memory consisting of 150 millions records in csv format.

Should i split this info smaller files and treat each file length as the batch size ?

All the examples I’ve seen in tutorials refer to images. ie 1 file per test example or if using a csv load the entire file into memory first.

The examples for custom dataset classes I’ve seen are as below. len returns the entire file length and getitem returns an individual record.

I’d have thought files larger than memory would be a common issue in this time of big data ?

def __init__(self, csv_path):

def __len__(self):
    return self.data_len
    
def __getitem__(self, index):
    return self.input[index], self.labels[index]

Any advice would be greatly appreciated.

thanks

ptrblck · February 20, 2019, 11:44pm

You could use pd.read_csv with the chunksize argument, so that you will only read smaller chunks of your data. Have a look at this documentation for an example.

DoubtWang · February 21, 2019, 1:20am

Can torch.utils.data.DataLoader be used to read batch size of data to memory ?

Dataloader. Combines a dataset and a sampler, and provides single- or multi-process iterators over the dataset.

ptrblck · February 21, 2019, 1:22am

If your Dataset loads the data lazily in __getitem__ each worker in your DataLoader will load one complete batch into memory.

DoubtWang · February 21, 2019, 1:32am

I don’t understand how Dataset loads the data lazily in __getitem__.
It means that the batch size is 1 ?
Thanks in advance for your explanation.

ptrblck · February 21, 2019, 1:35am

If your Dataset's __getitem__ looks like this:

def __getitem__(self, index):
      x = Image.open(self.paths[index])  # load lazily
      return x

then each call to __getitem__ will load a single sample using the index.
The Dataloader takes care of loading enough samples to create a whole batch specified by batch_size.

DoubtWang · February 21, 2019, 1:55am

Thanks for your detailed explanation.
I know what your mean.

Hubert · February 23, 2019, 12:57pm

thanks. I’ve made some progress but my setup doesn’t look to be iterating properly at the moment. the code doesn’t hit my print lines in the enumerate stage.

here’s my code:

import torch
from torch.utils import data
import pandas as pd

class MyDataset(data.Dataset):
    def __init__(self, csv_path, chunkSize):
        self.chunksize = chunkSize
        self.reader = pd.read_csv(csv_path, sep=',', chunksize=self.chunksize, header=None, iterator=True)
    
    def __len__(self):
        return self.chunksize

    def __getitem__(self, index):
        data = self.reader.get_chunk(self.chunksize)
        tensorData = torch.as_tensor(data.values, dtype=torch.float32)
        inputs = tensorData[:, :-1]
        labels = tensorData[:, 99]
        return inputs, labels 

def main():
    batch_size = 100

    kwargs = {}
    custom_data_from_csv = MyDataset('data/mydata.txt', batch_size)
    train_loader = data.DataLoader(dataset=custom_data_from_csv, batch_size=batch_size, shuffle=True, **kwargs)
            
    for inputData, target in enumerate(train_loader):
        print(inputData)
        print(target)

if __name__ == '__main__':
    main()

also i wasn’t sure if len was meant to be the size of each chunk or the total number of chunks in the entire file.

thiagocrepaldi · June 27, 2019, 10:40pm

The new ChunkDataset API might help you!

The way it works is through a hierarchical sampling, meaning it splits the dataset in chunks (set of examples) which are shuffled. Each chunk has its samples shuffled too (second layer of shuffling). C++ or Python Dataloader will retrieve batches from an internal buffer that holds just a few chunks, not the whole corpus.

In order to use it, all you need to do is to implement your own C++ ChunkDataReader, which parses a single chunk of data. This new chunk datareader is then passed to ChunkDataset, which will handle all the shuffling and chunk by chunk loading for you.

Look at an test example at: DataLoaderTest::ChunkDataSetGetBatch

Currently there is only C++ support, but python bindings are on the way (https://github.com/pytorch/pytorch/pull/21232) and any feedback is welcome

Junhao_Wen · October 29, 2019, 1:45pm

Hi,

Thanks for your explanation, but I am still a little confused. I think if we load the image lazily as you mentioned by using index, even the whole dataset requires much more memory than the machine has (let’t say that we have 20000+ 3D images). Pytorch will only take, in each iteration, batch_size * one image’s memory * number of batches <= total memory?

I also encountered this situation, where I have more than 10000 MRI images to train. Here is my getitem for my Datatset customized class, but my job terminated with any error.

class MRIDataset(Dataset):

def __init__(self, caps_directory, tsv_file, transformations=None):
    """
    Args:
        caps_directory (string): Directory of all the images.
        tsv_file (string): File name of the train/test split file.
        transformations (callable, optional): Optional transformations to be applied on a sample.

    """
    self.caps_directory = caps_directory
    self.transformations = transformations

    # Check the format of the tsv file here
    self.df = pd.read_csv(tsv_file, sep='\t')
    if ('age' not in list(self.df.columns.values)) or ('session_id' not in list(self.df.columns.values)) or \
       ('participant_id' not in list(self.df.columns.values)):
        raise Exception("the data file is not in the correct format."
                        "Columns should include ['participant_id', 'session_id', 'diagnosis']")
    self.participant_list = list(self.df['participant_id'])
    self.session_list = list(self.df['session_id'])
    self.age_list = list(self.df['age'])

def __len__(self):
    return len(self.participant_list)

def __getitem__(self, idx):
    img_name = self.participant_list[idx]
    sess_name = self.session_list[idx]
    age = self.age_list[idx]

    image_path = os.path.join(self.caps_directory, 'subjects', img_name, sess_name, 't1', 'dl',
                              img_name + '_' + sess_name + '_skull-removing_reg-linear.nii.gz')

    pt_path = os.path.join(self.caps_directory, 'subjects', img_name, sess_name, 't1', 'dl',
                              img_name + '_' + sess_name + '_skull-removing_reg-linear.pt')

    ## to accelarate the CPU to load image, we convert the nifti image to .pt format
    if not os.path.exists(pt_path):
        save_as_pt(image_path)

    image = torch.load(pt_path)

    # check if the data has NAN value
    if torch.isnan(image).any() == True:
        image[torch.isnan(image)] = 0

    if self.transformations:
        image = self.transformations(image)

    sample = {'image_id': img_name + '_' + sess_name, 'image': image, 'age': age}

    return sample

I do not know if this is caused by the memory limitation or not.

Thanks in advance

Hao

duskybomb · May 17, 2020, 5:30am

Hi could you find any solution to your problem? I am stuck with a similar problem where I have numpy array of size greater than my memory. If I split the data per instance/item then the process of accessing the disk for every item makes the dataloader very slow.

Aiman_Mutasem-bellh · August 15, 2020, 11:12am

Hello, sir about your method is it work with NMT dataset?

I have a NMT dataset in size of 199 MB for Training and 22.3 MB for dev. set. , batch size is 256, and the max-length of each sentence is 50 words. The data is loaded to GPU RAM without any problems when I start training I got Out of memory error.

SRC = Field(tokenize= normalizeString, init_token='<sos>', eos_token='<eos>', fix_length = 50, batch_first=True)
TRG = Field(tokenize= normalizeString, init_token='<sos>', eos_token='<eos>', fix_length = 50, batch_first=True) 

train_data, valid_data = TabularDataset.splits(path='./data/',train='SCUT_train.csv',
    validation=''SCUT_.csv' , format='csv',
    fields=[('src', SRC), ('trg', TRG)], skip_header=True)

SRC.build_vocab(train_data, min_freq = 2) 
TRG.build_vocab(train_data, min_freq = 2)

BATCH_SIZE = 128

train_iterator, vali_iterator = BucketIterator.splits((train_data, valid_data), sort_key=lambda x: len(x.src),
     batch_size = BATCH_SIZE, device = device)

My dataset is small (212.3 MB) and I can’t use the whole training set, what if I use 1GB data or more?

Afshin_Samani · November 1, 2021, 7:56am

Hi,

following the discussion on the memory management, my original dataset can very well fit into the memory but before I’d able to provide my samples for training I have to do some pre-processing including up-sampling. if I up-sample all the entities in my dataset, that cannot be contained in the memory any longer. so using the custom dataset and dataloder, I was hoping that this would be done within a batch only and avoid surpassing the memory limit.

I have a code like this:

Blockquote
class NickiDataset(Dataset):

def __init__(self, X, y, transform=None, target_transform=None):

    self.target = y

    self.Xdata = X

    self.transform = transform

    self.target_transform = target_transform

def __len__(self):

    return len(self.target)

def __getitem__(self, idx):

    data_i = self.Xdata[idx,:,:]

    target_i = self.target[idx,:]

    if self.transform:

        data_i = self.transform(data_i)

    if self.target_transform:

        target_i = self.target_transform(target_i)

    return data_i, target_i

Blockquote

class TrnsDt(object):

"""Convert ndarrays in sample to Tensors."""

def __call__(self, sample):

    

    global count

    if count==0:

      print(sample.shape)

    rsmpl=samplerate.resample(sample, 160 , 'sinc_best')

    

    pt_sample = torch.from_numpy(np.transpose(rsmpl))

    if count==0:

      print(pt_sample.shape)

    

    #to_net=pt_sample.tolist()

    #prc_Wv, mdl_wv

    input_values = prc_Wv(pt_sample, return_tensors="pt", padding="longest", sampling_rate= 16e3).input_values

    out=mdl_wv(torch.squeeze(input_values))

    out_proc = out[0].clone().detach()

    if count==0:

      print(out_proc.shape)

  

    return out_proc

class TrnsDtTrgt(object):

"""Convert ndarrays in sample to Tensors."""

def __call__(self, sample):

    

    pt_sample =  torch.from_numpy(sample)

          

    return pt_sample

train_ds = NickiDataset(train_dt_in, train_trgt,
transform=TrnsDt(), target_transform= TrnsDtTrgt())

batch_size=16

train_dataloader = DataLoader(dataset= train_ds, batch_size=batch_size, shuffle=True)

num_epoch=1

total_train_sample=len(train_ds)

n_iterations= np.ceil(total_train_sample/batch_size)

global count

count=0

for epoch in range(num_epoch):

for inx, (dt_in, trgt) in enumerate(train_dataloader):

               

    if count==0:

        print("input tensor shape: ", dt_in.shape)

               

    if (inx+1) % 5 == 0:

        print(f'epoch {epoch}/{num_epoch}, step {inx+1}/{n_iterations}, input {dt_in.shape}')

Any suggestion why this still makes a trouble in memory management?

Best,

ptrblck · November 1, 2021, 8:27am

I assume “trouble in memory management” means you are running out of host RAM?
If so, make sure that loading the entire (small) dataset fits into the RAM as well as a batch using the larger images (intermediates could also be stored). In case you are using multiple workers, note that each worker would copy the Dataset and thus also the preloaded small dataset.
If you are using the CPU for the actual training then note that also the model training (parameters, forward activations, gradients etc.) also need to be stored in the RAM.

Xuhui_Zhou · February 20, 2023, 11:58pm

Hi ptrblck,

What if my raw data (60000 x 4000 x 10) is in a single file (.pt)? 60000 is the number of training samples and 4000x10 is the input dimension. I did the lazy loading in this way but the ram cannot handle even if my batch_size is very small. Can you help me with this?

import os
import torch
from torch.utils.data import Dataset

class MyDataset(Dataset):  # create dataset class

    def __init__(self, feature_dir, label_dir):
        self.feature_dir = feature_dir
        self.label_dir = label_dir

        self.labels = torch.load(label_dir)

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, index):
        X_data = torch.load(self.feature_dir).float()[index]
        y_label = self.labels.float()[index]

        return X_data, y_label

ptrblck · February 21, 2023, 12:29am

In your code snippet you are loading the entire tensor in the __init__ method which explains the high memory usage. You could try to store the tensor as a numpy array and use mmap to load chunks of the data.

goodboy-hub · February 21, 2023, 2:03am

I am confused that loading the dataset in getitem weather would load the whole dataset in memory?

Xuhui_Zhou · February 21, 2023, 2:37pm

Hi ptrblck,

Thanks for your reply! I’ll try this method.

aisven · March 18, 2023, 1:33am

Hi,

I find this thread very interesting.

I am too searching for a way to lazily load data in chunks or batches from 1 large CSV file (the file is too large to fit into memory of the particular device).
Moreover, thereby I am also searching for a way to somehow randomly split this data into X_train, X_valid, X_test, y_train, y_valid, y_test for training, validation, testing, respectively.

2.1 However, the requirement of having validation data is optional, I would already be happy to have a split into just training and test data.

The problem is a classification with 1 target column and c target classes.

Vague ideas are currently to somehow use Dataset and DataLoader. Perhaps somehow using pre-computed indices, or splitting chunks or batches on the fly, or whatever solution might be available.

During my little journey I came by utils.data.random_split and Subset and ChunkDataset. However, so far, I could not figure out how to fulfill the above stated requirements in a pure PyTorch way.

As of today, maybe still in the works, is there a straight-forward way to do the above outlined things (1. with 2. and optionally with 2.1) purely with PyTorch, i.e. without integrating pandas or some self-written C++ code or something like that?

Any further ideas and of course also a comprehensive answer or a hint towards the most modern solution would be greatly appreciated.

Thanks

harpone · April 5, 2023, 8:00pm

You could try to save the csv data as a numpy memmap. Works quite well for language models. Another great alternative is to save it on disk as e.g. chunked zarr array (or use tensorstore). Should be really fast to read with multiple processes since in practice it’s just a bunch of different files.