Hey, I am fairly new to deep learning and am now doing a NLP project for the task of text summarization.
For some reason when I am loading in the dataset with the Field where I’ve specified the fixed length, the original length seems unchanged for the data points.
SRC = Field(tokenize = tokenize_en,
init_token = '<sos>',
eos_token = '<eos>',
fix_length = 400,
lower = True)
TRG = Field(tokenize = tokenize_en,
init_token = '<sos>',
eos_token = '<eos>',
fix_length = 100,
lower = True)
fields = {'doc': ('doc', SRC), 'summaries': ('summaries', TRG)}
train_data,valid_data,test_data = TabularDataset.splits(path='./', train='train.json', validation='val.json', test='test.json', format='json', fields=fields)
If i were to run the code beneath for example, the length still exceeds 400.
print(len(vars(train_data.examples[40])['doc']))
Can someone point out to me what I am doing wrong or perhaps suggest another solution? Thanks for your time!