torchtext.data.Field didn't padding

Lymo · July 28, 2020, 1:41pm

Hi, I’m using Bert to do relation classification.

Here is my setting for torchtext.Field

# Initialize tokenizer
pretrain_model = "bert-base-uncased"
additional_special_tokens = ['[E1]', '[/E1]', '[E2]', '[/E2]']
tokenizer = BertTokenizer.from_pretrained(pretrain_model, do_lower_case=True, additional_special_tokens = additional_special_tokens)

# Model parameters
MAX_SEQ_LEN = 512
PAD_INDEX = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
UNK_INDEX = tokenizer.convert_tokens_to_ids(tokenizer.unk_token)

batch_size = 32
# Fields

label_field = Field(sequential=False, use_vocab=False, batch_first=True, dtype=torch.float)
text_field = Field(use_vocab=False, tokenize=tokenizer.encode, lower=False, include_lengths=False, batch_first=True,
                   fix_length=MAX_SEQ_LEN, pad_token=PAD_INDEX, unk_token=UNK_INDEX)
fields = [('label', label_field),  ('text', text_field)]

And I try to verify if it works correctly, (from torchtext docs):

temp_preprocess = text_field.preprocess(train_data.iloc[0:10, -1])

Unfortunately, it didn’t work. It looks the same as what it used to be.
No padding, no tokenization:

[CLS] sen. charles e. schumer called on federal safety officials yesterday to reopen their investigation into the fatal crash of a passenger jet in [E2] belle_harbor [/E2] , [E1] queens [/E1] , because equipment failure , not pilot error , might have been the cause . [SEP]
[CLS] but instead there was a funeral , at st. francis de sales roman catholic church , in [E2] belle_harbor [/E2] , [E1] queens [/E1] , the parish of his birth . [SEP]
[CLS] rosemary antonelle , the daughter of teresa l. antonelle and patrick antonelle of [E2] belle_harbor [/E2] , [E1] queens [/E1] , was married yesterday afternoon to lt. thomas joseph quast , a son of peggy b. quast and vice adm. philip m. quast of carmel , calif. . [SEP]

I also tried using TabularDataset to read data from files:

train_td = TabularDataset(path='./origin_data/train_filtered_bf.txt', format='tsv', fields=fields)
test_td = TabularDataset(path='./origin_data/test_filtered_bf.txt', format='tsv', fields=fields)
print(train_td[1].text)

And the result didn’t have any padding token as well:

[101, 101, 2021, 2612, 2045, 2001, 1037, 6715, 1010, 2012, 2358, 1012, 4557, 2139, 4341, 3142, 3234, 2277, 1010, 1999, 30525, 9852, 1035, 6496, 30522, 1010, 30524, 8603, 30523, 1010, 1996, 3583, 1997, 2010, 4182, 1012, 102, 102]

I’m wondering did I missed any things? Please help me with this. Thanks in advance.