Is there a way to use https://torchtext.readthedocs.io/en/latest/data.html#tabulardataset for loading multiple csv files? Can use a wildcard pattern in path?
Say the train data is located in multiple files.
dataset = TabularDataset(path = "..",
format = "csv", skip_header = True, fields = csv_datafields)
There isn’t, but you can grab the code (which isn’t long and almost exclusively about loading the file) and wrap the
with block linked in a for loop. If you then start
examples with an empty list and use
examples += ... after reading the file, you’re good to go.
@tom thanks for the reply! I see, I wasn’t looking to modify the package. I was thinking of concating Tabular datasets instead somehow.
Not that there is anything wrong with modelling your own dataset after torchtets, but perhaps the following looks more straightforward:
You cannot take the PyTorch dataset concatenation if you want to stick with the torchtext.data.Dataset. What you could do is take two (or more) tabular datasets, say
ds2, and create a new
ds1.examples + ds2.examples and
ds1.fields to the constructor (provided fields are compatible). That should give you a combined dataset supporting the torchtext extras like splits.