Hi,
I am trying to train a question answering dataset similar to SQuAD setting. I managed to preprocess the sequence in each example such that each example is split into multiple samples to be able to fit in max_length of BERT using sliding window approach and pad each sequence if needed to max_length=384 and used the default collate_fn. I load the dataset using Dataset class. my get_item method looks as following
# tokenize sequence and find start and end position after tokenization and match concepts to tokens
tokenized_example, concepts_ids, lengths = self.prepare_train_features(self.data[index], self.wikidata_concepts[index])
#convert to tensor
all_input_ids = torch.tensor(tokenized_example['input_ids'], dtype=torch.long)
print("input ids ", all_input_ids)
all_input_mask = torch.tensor(tokenized_example['attention_mask'], dtype=torch.long)
print("input mask ", all_input_mask.shape)
all_segment_ids = torch.tensor(tokenized_example['token_type_ids'], dtype=torch.long)
print("segment ids ", all_segment_ids.shape)
#all_cls_index = torch.tensor([f.cls_index for f in tokenized_example], dtype=torch.long)
#all_p_mask = torch.tensor([f.p_mask for f in tokenized_example], dtype=torch.float)
all_start_positions = torch.tensor( tokenized_example['start_positions'], dtype=torch.long)
all_end_positions = torch.tensor(tokenized_example['end_positions'], dtype=torch.long)
print("all_start_positions ", all_start_positions.shape)
all_concepts_ids = torch.tensor( concepts_ids, dtype=torch.long)
all_lengths = torch.tensor( lengths, dtype=torch.long)
print("concepts ", all_concepts_ids.shape)
so the shape of my tensors depends on the sequence length of the example if it’s less than 384 the shape will be:
input ids torch.Size([1, 384])
input mask torch.Size([1, 384])
segment ids torch.Size([1, 384])
all_start_positions torch.Size([1])
concepts torch.Size([1, 384, 20])
but if the example consists of multiple sequences the first dim change to be 2,3 or any number depend on the length.
when I load the data I received error message which is obvious due to different size of first dim. I need to treat each sample from the long sequence as one example in the batch but I don’t know how to do it. I really appreciate any hint.