Recommended way to build datapipe grouped/batched by csv file

thomasgho · December 1, 2022, 5:21am

Trying to load protein data from many csv files. Each csv file corresponds to one protein and contains multiple rows (the number of rows varies for each csv file). I’d like to build a datapipe such that the protein data is grouped (like having uneven batches, with each batch representing one protein).

The default csv parser and the examples online yield one row at a time. Currently, to group the individual rows to their corresponding protein, I use the default csv parser (yielding one row at a time) with return_path=True. Then I use Grouper to group rows with the same file path:

def build_datapipe(root_dir):
    datapipe = dp.iter.FileLister(root_dir, recursive=True)
    datapipe = dp.iter.FileOpener(datapipe, mode='rt')
    datapipe = datapipe.parse_csv(skip_lines=1, return_path=True)
    datapipe = datapipe.map(preprocess) # row-wise preprocessing
    datapipe = datapipe.groupby(group_key_fn=group_fn)
    return datapipe

Is there a recommended way to do this? Perhaps writing a custom csv parser that yields multiple rows at a time.

nivek · December 1, 2022, 7:03pm

I think writing a custom DataPipe to yield a list of row from the same CSV makes sense. Or you can write a custom sequential grouper that is similar to ParagraphAggregator.