Huge Data, issue with DataLoader

zeneofa · March 6, 2018, 6:53pm

Hi,

I am trying to create a dataloader for a rather huge dataset. The dataset consists of a bunch of files, each file consists of two fields, where each line consists of two strings tab separated (the strings are of fixed length). I have created a master file that lists some metadata on each of the dataset files. The master file contains the filepath, the number of entries in each file and a cumulative line count over the files. Here is a gist of my dataset object:

gist.github.com

https://gist.github.com/zeneofa/d8fc2063e53591630c676bc9390d0ce4

InteractionData

from torch.utils.data import Dataset, DataLoader
import linecache
import numpy as np
class InteractionData(Dataset):
    """ Interacts with the interaction data set 
    
    :param inputfile: file containing four columns, no header. The first column has the number of lines of the file whos path is given in the second column, the third column is the cumulative sum of the first. 
        The final column is just an 0-based index (.
    :param property file: file containing the properties of each character. Tab seperated.
    :param seq_length: length of the string in the inputfile.

This file has been truncated. show original

The master file is used to covert an index value to a particular line, in a particular file for the getitem file. The number of records (total lines across all files) is rather large ~18 billion (about 1.1TB).

I have two issues: 1) The when wrapping the Dataset object in DataLoader it takes a really long time, I am not sure what DataLoader is actually doing. 2) It then produces an error:
train_data = InteractionData(‘positiveSampleInfoFile-train.txt’,‘propertyFile.tsv’,25,0.5)

=======
IndexErrorTraceback (most recent call last)
in ()
----> 1 for _ in train_data:
2 break

/lustre/atlas1/bif102/proj-shared/piet/miniconda-rhea/lib/python2.7/site-packages/torch/utils/data/dataloader.pyc in next(self)
185 def next(self):
186 if self.num_workers == 0: # same-process loading
–> 187 indices = next(self.sample_iter) # may raise StopIteration
188 batch = self.collate_fn([self.dataset[i] for i in indices])
189 if self.pin_memory:

/lustre/atlas1/bif102/proj-shared/piet/miniconda-rhea/lib/python2.7/site-packages/torch/utils/data/sampler.pyc in iter(self)
117 def iter(self):
118 batch = []
–> 119 for idx in self.sampler:
120 batch.append(idx)
121 if len(batch) == self.batch_size:

/lustre/atlas1/bif102/proj-shared/piet/miniconda-rhea/lib/python2.7/site-packages/torch/utils/data/sampler.pyc in iter(self)
48
49 def iter(self):
—> 50 return iter(torch.randperm(len(self.data_source)).long())
51
52 def len(self):

/lustre/atlas1/bif102/proj-shared/piet/miniconda-rhea/lib/python2.7/site-packages/torch/tensor.pyc in iter(self)
156 def iter(self):
157 if self.nelement() > 0:
–> 158 return iter(map(lambda i: self.select(0, i), _range(self.size(0))))
159 else:
160 return iter([])

IndexError: list assignment index out of range

=========

What am I doing wrong? Any advice would be much appreciated.

richard · March 6, 2018, 7:43pm

The error looks like you’re indexing into a list(?) and the index is out of range.

In general I’d advise you to start small and not try to load all that data: maybe try reducing the size of the files you’re using so you can iterate on this dataset design faster?

zeneofa · March 6, 2018, 7:58pm

Hi Richard, will test a small case. From my initial investigation the error occurs before the dataset object’s __getitem__ is even called, as the iterator is being setup. I littered the __getitem__ call with print statements and none of them triggered. If I limit the return value of __len__ to the size of one file, it works, it also works for some arbitrary higher value such as 50 million. So am wondering if there some intrinsic limit?

richard · March 6, 2018, 8:04pm

Python 2.7 has a separation between its int and long types. Because you’re using such large numbers, it’s possible that you’re running into this, but right I’m not sure how this would come into play.

In Python 3 these are all the same; you could also try testing the same code with python 3.