Dataloader for partitioned dataset

iknunyants · April 21, 2022, 10:15pm

I have a large dataframe dataset. There are float columns but also columns with lists of int. I’ve partitioned it into smaller files. Tried custom dataset which gets [filename, index] in getitem and loads this file if not already loaded. But when I set num_workers > 1, ram used by processes keeps increasing sufficiently. What is the efficient way to construct dataloader which can read files one by one?

ejguan · April 25, 2022, 8:08pm

Could you share code snippet?

iknunyants · April 25, 2022, 8:47pm

Here is my code in custom sampler. In self.indices[i][j], i - parquet file, j - set of indicies

def __iter__(self):
    all_batches = []
    for j, file in enumerate(self.datafiles):
        batches = []
        for inds in self.indices[j]:
            sampler = BatchSampler(SubsetRandomSampler(inds), batch_size=self.batch_size, drop_last=self.drop_last)
            for sample in sampler:
                batches.append(sample)
            shuffle(batches)
            all_batches.extend(batches)
    for batch in all_batches:
        yield batch

Here is what I do in my custom dataset getitem

def __getitem__(self, idx: Tuple[str, int]) -> Tuple:
    file, idx = idx
    if self.cur_file != file:
        self.cur_file = file
        del self.data
        self.data = pd.read_parquet(self.cur_file)
        gc.collect()
    credit = self.data.iloc[idx].copy()

ejguan · April 26, 2022, 2:34pm

I am curious about the size of each parquet file.
When you run multiple workers, the RAM usage would be num_of_workers * parquet_size in total. In order to reduce the amount of memory, either reduce num_of_worker or reduce parquet_size would be the best way.

iknunyants · April 26, 2022, 2:49pm

I agree, this is one way. But are there any solutions. Alternatives to pandas dataframe? Other structure of my custom dataset? I’m open to your ideas

ejguan · April 28, 2022, 7:59pm

IIUC, We currently do have a solution to stream parquet file using TorchData. But, it requires you to run pipeline in Iterative-style using DataPipe.
See: data/dataframemaker.py at b6ade8f097bc9ac08460cd403034a35daff09cfa · pytorch/data · GitHub

cc: @nivek

ejguan · April 28, 2022, 8:40pm

Besides, I am curious about credit object. Do you de-reference it later in your training loop?

iknunyants · May 2, 2022, 2:15pm

I do, yes. I’m taking column values from it.
Thanks for mentioning TorchData, I’ll try it later