DataLoaders - Multiple files, and multiple rows per column with lazy evaluation

Hi All,

I’m trying to create a DataSet class that can load many large file, and each file have rows of data a model would need to train.

I’ve read: Loading huge data functionality

class MyDataset(torch.utils.Dataset):
    def __init__(self):
        self.data_files = os.listdir('data_dir')
        sort(self.data_files)

    def __getindex__(self, idx):
        return load_file(self.data_files[idx])

    def __len__(self):
        return len(self.data_files)


dset = MyDataset()
loader = torch.utils.DataLoader(dset, num_workers=8)

Essentially if each file is an image, then you can use this functionality to only load the required images in memory.

So what about a file with multiple lines? Then I came across this github thread: https://github.com/pytorch/text/issues/130

class LazyTextDataset(Dataset):
    def __init__(self, filename):
        self._filename = filename
        self._total_data = 0
        with open(filename, "r") as f:
            self._total_data = len(f.readlines()) - 1

    def __getitem__(self, idx):
        line = linecache.getline(self._filename, idx + 1)
        csv_line = csv.reader([line])
        .....

    def __len__(self):
        return self._total_data

Now the problem I have is how do I combine these two so that I can load multiple files that are all large?

Since __getitem__ only uses 1 idx, there isn’t a way to load multiple file and each of their lines?

Can someone point me to the right direction?

7 Likes

torch.utils.data.ConcatDataset maybe helpful

http://pytorch.org/docs/0.3.0/data.html#torch.utils.data.ConcatDataset

2 Likes

I was looking at ConcatDataset too, but one of my question is does it support shuffle between datasets?

Let’s say I have 2 datasets, A, and B.

Can data be shuffled from A to B, and B to A? From the code (I don’t understand the code 100%), it seems that data within their own dataset are shuffled.

yes

shuffle was performed by dataloader.

Gotcha, I’ll try to use ConcatDataset.

Thanks for your help. Really appreciated your help!

So, does it mean I need to create a Dataset for every file in the directory? I have like 118 million train rows split in 3000 files.

Assuming there are N rows in each file (which now is Dataset), will the ConcatDatastet read individual Datasets into memory?

1 Like

I created one dataset for each file, and if there’s only 3000 files then it isn’t that much to hold it inside an array (object that has a reference to).

If you wrote your DataSet with linecache, then it won’t read each file into memory.

At least this is my observation after reading more files than my computer’s memory can support.

class LazyTextDataset(Dataset):
    def __init__(self, filename):
        self._filename = filename
        self._total_data = int(subprocess.check_output("wc -l " + filename, shell=True).split()[0]) - 1

    def __getitem__(self, idx):
        line = linecache.getline(self._filename, idx + 1)
        csv_line = csv.reader([line])
        return next(csv_line)
      
    def __len__(self):
        return self._total_data

path = /where_csv_files_are_dumped/
files = list(map(lambda x : path + x, (filter(lambda x : x.endswith("csv"), os.listdir(path)))))
datasets = list(map(lambda x : LazyTextDataset(x), files))
dataset = ConcatDataset(datasets)

This should work for multiple csv files where each row is represented by a training example.

2 Likes

I am going to be working with many files, that all can’t fit into memory at the same time.

Since the data is text that I would like to be loaded quickly, I was hoping to use ether pytorch.load/pickle files, or pandas/parquet files.

However, it looks like ConcatDatastet can’t work for my case, since there’s no way to load specific lines only, like with csv files and linecache.getline, so I’ll have to load all the files at once.

So now I’m thinking I’ll just load a fraction of the files, as much as will fit in memory, and do a ConcatDatastet for those files. Perhaps using pytorch multiprocessing to load those several files at once. And then use ConcatDatastet of that chunk.

I’ll probably have to code additional shuffling code for shuffling among the chunks.

Does this strategy sound like the best approach? Or are there some additional Pytorch tools that can help me around this?

Hey @Santosh-Gupta, if you’re looking for lazy loading for parquet files I’d recommend exploring https://github.com/vahidk/tfrecord or https://github.com/uber/petastorm.

  • TFRecord would require you to go from parquet to tfrecord.
  • Petastorm should work out of the box.
1 Like

Sorry !I’m new to pytorch.I want to check my guess is right.
while calling getitem in LazyTextDataset ,it will generate I/O one time.
For example, if my training dataset is 40000 ,it will generate I/O 40000 time.
Is it time-consuming job while we have to read training data?

I have been trying to use this method to go through my dataset of around one thousand csv files, each containing about 300k lines (one training sample per line).

This method leads to a slow buildup of CPU memory as I iterate through the batches until it crashes.

I have tried putting in linecache.clearcache() at the end of each batch but it isnt clearing up the memory.

Did you find a solution to this? It seems like exactly what I am trying to implement.