I am trying to train a question answering dataset similar to SQuAD setting. I managed to preprocess the sequence in each example such that each example is split into multiple samples to be able to fit in max_length of BERT using sliding window approach and pad each sequence if needed to max_length=384 and used the default collate_fn. I load the dataset using Dataset class. my get_item method looks as following
# tokenize sequence and find start and end position after tokenization and match concepts to tokens tokenized_example, concepts_ids, lengths = self.prepare_train_features(self.data[index], self.wikidata_concepts[index]) #convert to tensor all_input_ids = torch.tensor(tokenized_example['input_ids'], dtype=torch.long) print("input ids ", all_input_ids) all_input_mask = torch.tensor(tokenized_example['attention_mask'], dtype=torch.long) print("input mask ", all_input_mask.shape) all_segment_ids = torch.tensor(tokenized_example['token_type_ids'], dtype=torch.long) print("segment ids ", all_segment_ids.shape) #all_cls_index = torch.tensor([f.cls_index for f in tokenized_example], dtype=torch.long) #all_p_mask = torch.tensor([f.p_mask for f in tokenized_example], dtype=torch.float) all_start_positions = torch.tensor( tokenized_example['start_positions'], dtype=torch.long) all_end_positions = torch.tensor(tokenized_example['end_positions'], dtype=torch.long) print("all_start_positions ", all_start_positions.shape) all_concepts_ids = torch.tensor( concepts_ids, dtype=torch.long) all_lengths = torch.tensor( lengths, dtype=torch.long) print("concepts ", all_concepts_ids.shape)
so the shape of my tensors depends on the sequence length of the example if it’s less than 384 the shape will be:
input ids torch.Size([1, 384])
input mask torch.Size([1, 384])
segment ids torch.Size([1, 384])
concepts torch.Size([1, 384, 20])
but if the example consists of multiple sequences the first dim change to be 2,3 or any number depend on the length.
when I load the data I received error message which is obvious due to different size of first dim. I need to treat each sample from the long sequence as one example in the batch but I don’t know how to do it. I really appreciate any hint.