How to use dataset larger than memory?

I have a dataset consisting of 1 large file which is larger than memory consisting of 150 millions records in csv format.

Should i split this info smaller files and treat each file length as the batch size ?

All the examples I’ve seen in tutorials refer to images. ie 1 file per test example or if using a csv load the entire file into memory first.

The examples for custom dataset classes I’ve seen are as below. len returns the entire file length and getitem returns an individual record.

I’d have thought files larger than memory would be a common issue in this time of big data ?

def __init__(self, csv_path):

def __len__(self):
    return self.data_len
    
def __getitem__(self, index):
    return self.input[index], self.labels[index]

Any advice would be greatly appreciated.

thanks

You could use pd.read_csv with the chunksize argument, so that you will only read smaller chunks of your data. Have a look at this documentation for an example.

Can torch.utils.data.DataLoader be used to read batch size of data to memory ?

Dataloader. Combines a dataset and a sampler, and provides single- or multi-process iterators over the dataset.

If your Dataset loads the data lazily in __getitem__ each worker in your DataLoader will load one complete batch into memory.

I don’t understand how Dataset loads the data lazily in __getitem__.
It means that the batch size is 1 ?
Thanks in advance for your explanation.

If your Dataset's __getitem__ looks like this:

def __getitem__(self, index):
      x = Image.open(self.paths[index])  # load lazily
      return x

then each call to __getitem__ will load a single sample using the index.
The Dataloader takes care of loading enough samples to create a whole batch specified by batch_size.

3 Likes

Thanks for your detailed explanation.
I know what your mean.

thanks. I’ve made some progress but my setup doesn’t look to be iterating properly at the moment. the code doesn’t hit my print lines in the enumerate stage.

here’s my code:

import torch
from torch.utils import data
import pandas as pd

class MyDataset(data.Dataset):
    def __init__(self, csv_path, chunkSize):
        self.chunksize = chunkSize
        self.reader = pd.read_csv(csv_path, sep=',', chunksize=self.chunksize, header=None, iterator=True)
    
    def __len__(self):
        return self.chunksize

    def __getitem__(self, index):
        data = self.reader.get_chunk(self.chunksize)
        tensorData = torch.as_tensor(data.values, dtype=torch.float32)
        inputs = tensorData[:, :-1]
        labels = tensorData[:, 99]
        return inputs, labels 

def main():
    batch_size = 100

    kwargs = {}
    custom_data_from_csv = MyDataset('data/mydata.txt', batch_size)
    train_loader = data.DataLoader(dataset=custom_data_from_csv, batch_size=batch_size, shuffle=True, **kwargs)
            
    for inputData, target in enumerate(train_loader):
        print(inputData)
        print(target)

if __name__ == '__main__':
    main()

also i wasn’t sure if len was meant to be the size of each chunk or the total number of chunks in the entire file.

1 Like

The new ChunkDataset API might help you!

The way it works is through a hierarchical sampling, meaning it splits the dataset in chunks (set of examples) which are shuffled. Each chunk has its samples shuffled too (second layer of shuffling). C++ or Python Dataloader will retrieve batches from an internal buffer that holds just a few chunks, not the whole corpus.

In order to use it, all you need to do is to implement your own C++ ChunkDataReader, which parses a single chunk of data. This new chunk datareader is then passed to ChunkDataset, which will handle all the shuffling and chunk by chunk loading for you.

Look at an test example at: DataLoaderTest::ChunkDataSetGetBatch

Currently there is only C++ support, but python bindings are on the way (https://github.com/pytorch/pytorch/pull/21232) and any feedback is welcome

3 Likes

Hi,

Thanks for your explanation, but I am still a little confused. I think if we load the image lazily as you mentioned by using index, even the whole dataset requires much more memory than the machine has (let’t say that we have 20000+ 3D images). Pytorch will only take, in each iteration, batch_size * one image’s memory * number of batches <= total memory?

I also encountered this situation, where I have more than 10000 MRI images to train. Here is my getitem for my Datatset customized class, but my job terminated with any error.

class MRIDataset(Dataset):

def __init__(self, caps_directory, tsv_file, transformations=None):
    """
    Args:
        caps_directory (string): Directory of all the images.
        tsv_file (string): File name of the train/test split file.
        transformations (callable, optional): Optional transformations to be applied on a sample.

    """
    self.caps_directory = caps_directory
    self.transformations = transformations

    # Check the format of the tsv file here
    self.df = pd.read_csv(tsv_file, sep='\t')
    if ('age' not in list(self.df.columns.values)) or ('session_id' not in list(self.df.columns.values)) or \
       ('participant_id' not in list(self.df.columns.values)):
        raise Exception("the data file is not in the correct format."
                        "Columns should include ['participant_id', 'session_id', 'diagnosis']")
    self.participant_list = list(self.df['participant_id'])
    self.session_list = list(self.df['session_id'])
    self.age_list = list(self.df['age'])

def __len__(self):
    return len(self.participant_list)

def __getitem__(self, idx):
    img_name = self.participant_list[idx]
    sess_name = self.session_list[idx]
    age = self.age_list[idx]

    image_path = os.path.join(self.caps_directory, 'subjects', img_name, sess_name, 't1', 'dl',
                              img_name + '_' + sess_name + '_skull-removing_reg-linear.nii.gz')

    pt_path = os.path.join(self.caps_directory, 'subjects', img_name, sess_name, 't1', 'dl',
                              img_name + '_' + sess_name + '_skull-removing_reg-linear.pt')

    ## to accelarate the CPU to load image, we convert the nifti image to .pt format
    if not os.path.exists(pt_path):
        save_as_pt(image_path)

    image = torch.load(pt_path)

    # check if the data has NAN value
    if torch.isnan(image).any() == True:
        image[torch.isnan(image)] = 0

    if self.transformations:
        image = self.transformations(image)

    sample = {'image_id': img_name + '_' + sess_name, 'image': image, 'age': age}

    return sample

I do not know if this is caused by the memory limitation or not.

Thanks in advance

Hao

Hi could you find any solution to your problem? I am stuck with a similar problem where I have numpy array of size greater than my memory. If I split the data per instance/item then the process of accessing the disk for every item makes the dataloader very slow.

Hello, sir about your method is it work with NMT dataset?

I have a NMT dataset in size of 199 MB for Training and 22.3 MB for dev. set. , batch size is 256, and the max-length of each sentence is 50 words. The data is loaded to GPU RAM without any problems when I start training I got Out of memory error.

SRC = Field(tokenize= normalizeString, init_token='<sos>', eos_token='<eos>', fix_length = 50, batch_first=True)
TRG = Field(tokenize= normalizeString, init_token='<sos>', eos_token='<eos>', fix_length = 50, batch_first=True) 

train_data, valid_data = TabularDataset.splits(path='./data/',train='SCUT_train.csv',
    validation=''SCUT_.csv' , format='csv',
    fields=[('src', SRC), ('trg', TRG)], skip_header=True)

SRC.build_vocab(train_data, min_freq = 2) 
TRG.build_vocab(train_data, min_freq = 2)

BATCH_SIZE = 128

train_iterator, vali_iterator = BucketIterator.splits((train_data, valid_data), sort_key=lambda x: len(x.src),
     batch_size = BATCH_SIZE, device = device)

My dataset is small (212.3 MB) and I can’t use the whole training set, what if I use 1GB data or more?

Hi,

following the discussion on the memory management, my original dataset can very well fit into the memory but before I’d able to provide my samples for training I have to do some pre-processing including up-sampling. if I up-sample all the entities in my dataset, that cannot be contained in the memory any longer. so using the custom dataset and dataloder, I was hoping that this would be done within a batch only and avoid surpassing the memory limit.

I have a code like this:

Blockquote
class NickiDataset(Dataset):

def __init__(self, X, y, transform=None, target_transform=None):

    self.target = y

    self.Xdata = X

    self.transform = transform

    self.target_transform = target_transform

def __len__(self):

    return len(self.target)

def __getitem__(self, idx):

    data_i = self.Xdata[idx,:,:]

    target_i = self.target[idx,:]

    if self.transform:

        data_i = self.transform(data_i)

    if self.target_transform:

        target_i = self.target_transform(target_i)

    return data_i, target_i

Blockquote

class TrnsDt(object):

"""Convert ndarrays in sample to Tensors."""

def __call__(self, sample):

    

    global count

    if count==0:

      print(sample.shape)

    rsmpl=samplerate.resample(sample, 160 , 'sinc_best')

    

    pt_sample = torch.from_numpy(np.transpose(rsmpl))

    if count==0:

      print(pt_sample.shape)

    

    #to_net=pt_sample.tolist()

    #prc_Wv, mdl_wv

    input_values = prc_Wv(pt_sample, return_tensors="pt", padding="longest", sampling_rate= 16e3).input_values

    out=mdl_wv(torch.squeeze(input_values))

    out_proc = out[0].clone().detach()

    if count==0:

      print(out_proc.shape)

  

    return out_proc

class TrnsDtTrgt(object):

"""Convert ndarrays in sample to Tensors."""

def __call__(self, sample):

    

    pt_sample =  torch.from_numpy(sample)

          

    return pt_sample

train_ds = NickiDataset(train_dt_in, train_trgt,
transform=TrnsDt(), target_transform= TrnsDtTrgt())

batch_size=16

train_dataloader = DataLoader(dataset= train_ds, batch_size=batch_size, shuffle=True)

num_epoch=1

total_train_sample=len(train_ds)

n_iterations= np.ceil(total_train_sample/batch_size)

global count

count=0

for epoch in range(num_epoch):

for inx, (dt_in, trgt) in enumerate(train_dataloader):

               

    if count==0:

        print("input tensor shape: ", dt_in.shape)

               

    if (inx+1) % 5 == 0:

        print(f'epoch {epoch}/{num_epoch}, step {inx+1}/{n_iterations}, input {dt_in.shape}')

Any suggestion why this still makes a trouble in memory management?

Best,

I assume “trouble in memory management” means you are running out of host RAM?
If so, make sure that loading the entire (small) dataset fits into the RAM as well as a batch using the larger images (intermediates could also be stored). In case you are using multiple workers, note that each worker would copy the Dataset and thus also the preloaded small dataset.
If you are using the CPU for the actual training then note that also the model training (parameters, forward activations, gradients etc.) also need to be stored in the RAM.