Dataloader for partitioned dataset

I have a large dataframe dataset. There are float columns but also columns with lists of int. I’ve partitioned it into smaller files. Tried custom dataset which gets [filename, index] in getitem and loads this file if not already loaded. But when I set num_workers > 1, ram used by processes keeps increasing sufficiently. What is the efficient way to construct dataloader which can read files one by one?

Could you share code snippet?

Here is my code in custom sampler. In self.indices[i][j], i - parquet file, j - set of indicies

def __iter__(self):
    all_batches = []
    for j, file in enumerate(self.datafiles):
        batches = []
        for inds in self.indices[j]:
            sampler = BatchSampler(SubsetRandomSampler(inds), batch_size=self.batch_size, drop_last=self.drop_last)
            for sample in sampler:
                batches.append(sample)
            shuffle(batches)
            all_batches.extend(batches)
    for batch in all_batches:
        yield batch

Here is what I do in my custom dataset getitem

def __getitem__(self, idx: Tuple[str, int]) -> Tuple:
    file, idx = idx
    if self.cur_file != file:
        self.cur_file = file
        del self.data
        self.data = pd.read_parquet(self.cur_file)
        gc.collect()
    credit = self.data.iloc[idx].copy()

I am curious about the size of each parquet file.
When you run multiple workers, the RAM usage would be num_of_workers * parquet_size in total. In order to reduce the amount of memory, either reduce num_of_worker or reduce parquet_size would be the best way.

I agree, this is one way. But are there any solutions. Alternatives to pandas dataframe? Other structure of my custom dataset? I’m open to your ideas :slightly_smiling_face:

IIUC, We currently do have a solution to stream parquet file using TorchData. But, it requires you to run pipeline in Iterative-style using DataPipe.
See: data/dataframemaker.py at b6ade8f097bc9ac08460cd403034a35daff09cfa · pytorch/data · GitHub

cc: @nivek

1 Like

Besides, I am curious about credit object. Do you de-reference it later in your training loop?

I do, yes. I’m taking column values from it.
Thanks for mentioning TorchData, I’ll try it later