A custom dataset in Torchtext where each data is (text1, text2, and label) can be written something like below:
class CustomDataset(ttdata.Dataset):
'Reads a JSON file. Each line has text1, text2, and the associated label.'
def __init__(self, path, max_seq_length=100, min_seq_length=3, add_eos=True):
text1_field = Field(lower=True, tokenize='spacy')
text2_field = Field(lower=True, tokenize='spacy')
label_field = ttdata.Field(sequential=False, use_vocab=False)
fields = [('text1', text1_field), ('text2', text2_field), ('label', label_field)]
examples = []
with open(path, 'r') as f:
for item in f:
item = json.loads(item)
text1, text2, label = item['text1'], item['text2'], item['label']
examples.append(ttdata.Example.fromlist([text1, text2, label], fields))
super(CustomDataset, self).__init__(examples, fields)
But in my case rather than text1 and text2, I have 2 lists of strings of variable length (text1 = [“some text”, “some text”, …] and text2 = [“yet some text”, “yet some text”, …])and a label for the 2 lists.
How can I write a dataset class in this case? I think the main problem is now I cannot define fixed number of Field objects. Can NestedField be used in some way?