Tabular dataset for multiple files

abhigenie92 · May 5, 2020, 11:23am

Is there a way to use https://torchtext.readthedocs.io/en/latest/data.html#tabulardataset for loading multiple csv files? Can use a wildcard pattern in path?

Say the train data is located in multiple files.


dataset = TabularDataset(path = "..", 
    format = "csv", skip_header = True, fields = csv_datafields)

tom · May 5, 2020, 11:56am

There isn’t, but you can grab the code (which isn’t long and almost exclusively about loading the file) and wrap the with block linked in a for loop. If you then start examples with an empty list and use examples += ... after reading the file, you’re good to go.

Best regards

Thomas

abhigenie92 · May 5, 2020, 12:43pm

@tom thanks for the reply! I see, I wasn’t looking to modify the package. I was thinking of concating Tabular datasets instead somehow.

tom · May 5, 2020, 1:35pm

Not that there is anything wrong with modelling your own dataset after torchtets, but perhaps the following looks more straightforward:
You cannot take the PyTorch dataset concatenation if you want to stick with the torchtext.data.Dataset. What you could do is take two (or more) tabular datasets, say ds1 and ds2, and create a new torchtext.data.Dataset passing ds1.examples + ds2.examples and ds1.fields to the constructor (provided fields are compatible). That should give you a combined dataset supporting the torchtext extras like splits.

Best regards

Thomas