Batching tensor with different size

Eman · September 30, 2021, 4:31am

Hi,
I am trying to train a question answering dataset similar to SQuAD setting. I managed to preprocess the sequence in each example such that each example is split into multiple samples to be able to fit in max_length of BERT using sliding window approach and pad each sequence if needed to max_length=384 and used the default collate_fn. I load the dataset using Dataset class. my get_item method looks as following

# tokenize sequence and find start and end position after tokenization and match concepts to tokens 
       tokenized_example, concepts_ids, lengths = self.prepare_train_features(self.data[index], self.wikidata_concepts[index])
       
       #convert to tensor
       all_input_ids = torch.tensor(tokenized_example['input_ids'], dtype=torch.long)
       print("input ids ", all_input_ids)

       all_input_mask = torch.tensor(tokenized_example['attention_mask'], dtype=torch.long)
       print("input mask ", all_input_mask.shape)

       all_segment_ids = torch.tensor(tokenized_example['token_type_ids'], dtype=torch.long)
       print("segment ids ", all_segment_ids.shape)
       
       #all_cls_index = torch.tensor([f.cls_index for f in tokenized_example], dtype=torch.long)
       #all_p_mask = torch.tensor([f.p_mask for f in tokenized_example], dtype=torch.float)
       all_start_positions = torch.tensor( tokenized_example['start_positions'], dtype=torch.long)
       all_end_positions = torch.tensor(tokenized_example['end_positions'], dtype=torch.long)
       print("all_start_positions ", all_start_positions.shape)
       
       all_concepts_ids = torch.tensor( concepts_ids, dtype=torch.long)
       
       all_lengths = torch.tensor( lengths, dtype=torch.long)
       print("concepts ", all_concepts_ids.shape)

so the shape of my tensors depends on the sequence length of the example if it’s less than 384 the shape will be:
input ids torch.Size([1, 384])
input mask torch.Size([1, 384])
segment ids torch.Size([1, 384])
all_start_positions torch.Size([1])
concepts torch.Size([1, 384, 20])
but if the example consists of multiple sequences the first dim change to be 2,3 or any number depend on the length.
when I load the data I received error message which is obvious due to different size of first dim. I need to treat each sample from the long sequence as one example in the batch but I don’t know how to do it. I really appreciate any hint.

ptrblck · September 30, 2021, 9:12am

I’m not sure I fully understand the issue, so please correct me if I’m misunderstanding something.
Based on your explanation you are already cropping the sequences to a MAX_SEQ length and are also padding them if needed, which is great!
However, now you are somehow changing the size of dim0 in some of the tensors, while others are still using a size of 1, which causes the issue?
If so, why don’t all tensors increase the size in this particular dimension?

Eman · October 3, 2021, 7:15am

I think the solution is to pad the sequence in dim0 to max_length of dim0 found for all examples in the batch. I found the following collate_fn in the forum which pads any needed dimension and used it in my code

class ZeroPadCollator:

    @staticmethod
    def collate_tensors(batch: List[torch.Tensor]) -> torch.Tensor:
        dims = batch[0].dim()
        max_size = [max([b.size(i) for b in batch]) for i in range(dims)]
        size = (len(batch),) + tuple(max_size)
        canvas = batch[0].new_zeros(size=size)
        for i, b in enumerate(batch):
            sub_tensor = canvas[i]
            for d in range(dims):
                sub_tensor = sub_tensor.narrow(d, 0, b.size(d))
            sub_tensor.add_(b)
        return canvas

    def collate(self, batch, ) -> List[torch.Tensor]:
        print(batch)
        dims = len(batch[0])
        return [self.collate_tensors([b[i] for b in batch]) for i in range(dims)]

However, the batch in collate_fn retruns a list of None : [None, None, None, None, None, None, None, None]
I don’t know why? I assume batch in collate will returns a batch of tensors returned from get_items in DataLoader.

Any help is really appreciated.

ptrblck · October 3, 2021, 8:14am

Yes, the collate_fn would get the samples created and returned in Dataset.__getitem__.
I don’t know why all None values will be used so maybe write an example Dataset, make sure to get the valid samples, and compare it to your custom Dataset implementation.