Hi,
I am trying to create a dataloader for a rather huge dataset. The dataset consists of a bunch of files, each file consists of two fields, where each line consists of two strings tab separated (the strings are of fixed length). I have created a master file that lists some metadata on each of the dataset files. The master file contains the filepath, the number of entries in each file and a cumulative line count over the files. Here is a gist of my dataset object:
The master file is used to covert an index value to a particular line, in a particular file for the getitem file. The number of records (total lines across all files) is rather large ~18 billion (about 1.1TB).
I have two issues: 1) The when wrapping the Dataset object in DataLoader it takes a really long time, I am not sure what DataLoader is actually doing. 2) It then produces an error:
train_data = InteractionData(‘positiveSampleInfoFile-train.txt’,‘propertyFile.tsv’,25,0.5)
=======
IndexErrorTraceback (most recent call last)
in ()
----> 1 for _ in train_data:
2 break
/lustre/atlas1/bif102/proj-shared/piet/miniconda-rhea/lib/python2.7/site-packages/torch/utils/data/dataloader.pyc in next(self)
185 def next(self):
186 if self.num_workers == 0: # same-process loading
–> 187 indices = next(self.sample_iter) # may raise StopIteration
188 batch = self.collate_fn([self.dataset[i] for i in indices])
189 if self.pin_memory:
/lustre/atlas1/bif102/proj-shared/piet/miniconda-rhea/lib/python2.7/site-packages/torch/utils/data/sampler.pyc in iter(self)
117 def iter(self):
118 batch = []
–> 119 for idx in self.sampler:
120 batch.append(idx)
121 if len(batch) == self.batch_size:
/lustre/atlas1/bif102/proj-shared/piet/miniconda-rhea/lib/python2.7/site-packages/torch/utils/data/sampler.pyc in iter(self)
48
49 def iter(self):
—> 50 return iter(torch.randperm(len(self.data_source)).long())
51
52 def len(self):
/lustre/atlas1/bif102/proj-shared/piet/miniconda-rhea/lib/python2.7/site-packages/torch/tensor.pyc in iter(self)
156 def iter(self):
157 if self.nelement() > 0:
–> 158 return iter(map(lambda i: self.select(0, i), _range(self.size(0))))
159 else:
160 return iter([])
IndexError: list assignment index out of range
=========
What am I doing wrong? Any advice would be much appreciated.