Custom Dataset in Torchtext with variable number of Fields

leewee · November 1, 2019, 4:25am

A custom dataset in Torchtext where each data is (text1, text2, and label) can be written something like below:

class CustomDataset(ttdata.Dataset):
    'Reads a JSON file. Each line has text1, text2, and the associated label.'

    def __init__(self, path, max_seq_length=100, min_seq_length=3, add_eos=True):

        text1_field = Field(lower=True, tokenize='spacy')
        text2_field = Field(lower=True, tokenize='spacy')
        label_field = ttdata.Field(sequential=False, use_vocab=False)
        fields = [('text1', text1_field), ('text2', text2_field), ('label', label_field)]

        examples = []

        with open(path, 'r') as f:
            for item in f:
                item = json.loads(item)
                text1, text2, label = item['text1'], item['text2'], item['label']
                examples.append(ttdata.Example.fromlist([text1, text2, label], fields))


        super(CustomDataset, self).__init__(examples, fields)

But in my case rather than text1 and text2, I have 2 lists of strings of variable length (text1 = [“some text”, “some text”, …] and text2 = [“yet some text”, “yet some text”, …])and a label for the 2 lists.

How can I write a dataset class in this case? I think the main problem is now I cannot define fixed number of Field objects. Can NestedField be used in some way?