Huge Data, issue with DataLoader

Hi,

I am trying to create a dataloader for a rather huge dataset. The dataset consists of a bunch of files, each file consists of two fields, where each line consists of two strings tab separated (the strings are of fixed length). I have created a master file that lists some metadata on each of the dataset files. The master file contains the filepath, the number of entries in each file and a cumulative line count over the files. Here is a gist of my dataset object:

The master file is used to covert an index value to a particular line, in a particular file for the getitem file. The number of records (total lines across all files) is rather large ~18 billion (about 1.1TB).

I have two issues: 1) The when wrapping the Dataset object in DataLoader it takes a really long time, I am not sure what DataLoader is actually doing. 2) It then produces an error:
train_data = InteractionData(‘positiveSampleInfoFile-train.txt’,‘propertyFile.tsv’,25,0.5)

=======
IndexErrorTraceback (most recent call last)
in ()
----> 1 for _ in train_data:
2 break

/lustre/atlas1/bif102/proj-shared/piet/miniconda-rhea/lib/python2.7/site-packages/torch/utils/data/dataloader.pyc in next(self)
185 def next(self):
186 if self.num_workers == 0: # same-process loading
–> 187 indices = next(self.sample_iter) # may raise StopIteration
188 batch = self.collate_fn([self.dataset[i] for i in indices])
189 if self.pin_memory:

/lustre/atlas1/bif102/proj-shared/piet/miniconda-rhea/lib/python2.7/site-packages/torch/utils/data/sampler.pyc in iter(self)
117 def iter(self):
118 batch = []
–> 119 for idx in self.sampler:
120 batch.append(idx)
121 if len(batch) == self.batch_size:

/lustre/atlas1/bif102/proj-shared/piet/miniconda-rhea/lib/python2.7/site-packages/torch/utils/data/sampler.pyc in iter(self)
48
49 def iter(self):
—> 50 return iter(torch.randperm(len(self.data_source)).long())
51
52 def len(self):

/lustre/atlas1/bif102/proj-shared/piet/miniconda-rhea/lib/python2.7/site-packages/torch/tensor.pyc in iter(self)
156 def iter(self):
157 if self.nelement() > 0:
–> 158 return iter(map(lambda i: self.select(0, i), _range(self.size(0))))
159 else:
160 return iter([])

IndexError: list assignment index out of range

=========

What am I doing wrong? Any advice would be much appreciated.

The error looks like you’re indexing into a list(?) and the index is out of range.

In general I’d advise you to start small and not try to load all that data: maybe try reducing the size of the files you’re using so you can iterate on this dataset design faster?

Hi Richard, will test a small case. From my initial investigation the error occurs before the dataset object’s __getitem__ is even called, as the iterator is being setup. I littered the __getitem__ call with print statements and none of them triggered. If I limit the return value of __len__ to the size of one file, it works, it also works for some arbitrary higher value such as 50 million. So am wondering if there some intrinsic limit?

Python 2.7 has a separation between its int and long types. Because you’re using such large numbers, it’s possible that you’re running into this, but right I’m not sure how this would come into play.

In Python 3 these are all the same; you could also try testing the same code with python 3.