Trying to load protein data from many csv files. Each csv file corresponds to one protein and contains multiple rows (the number of rows varies for each csv file). I’d like to build a datapipe such that the protein data is grouped (like having uneven batches, with each batch representing one protein).
The default csv parser and the examples online yield one row at a time. Currently, to group the individual rows to their corresponding protein, I use the default csv parser (yielding one row at a time) with return_path=True
. Then I use Grouper
to group rows with the same file path:
def build_datapipe(root_dir):
datapipe = dp.iter.FileLister(root_dir, recursive=True)
datapipe = dp.iter.FileOpener(datapipe, mode='rt')
datapipe = datapipe.parse_csv(skip_lines=1, return_path=True)
datapipe = datapipe.map(preprocess) # row-wise preprocessing
datapipe = datapipe.groupby(group_key_fn=group_fn)
return datapipe
Is there a recommended way to do this? Perhaps writing a custom csv parser that yields multiple rows at a time.