Torch.nn.utils.rnn.pad_sequence error!

Suraj520 · February 10, 2023, 10:58pm

I am working with machine translation task using LSTMs(English to Hinglish), I am facing an error related to the pad_sequence module while trying to use it in a collate_fn for leveraging multiple batches.

The snippet of the dataset, data loader and collater(where this pad_sequence is used) is quoted below along with Debug Log highlighting shapes of dataset thus formed.

1. Dataset Class

class TranslationDataset(Dataset):

    def __init__(self,dataframe,english_vocab,hinglish_vocab,transform=None):
        self.dataframe = dataframe
        self.english = self.dataframe['English'].values.tolist()
        self.hinglish = self.dataframe['Hinglish'].values.tolist()
        self.input_lang = english_vocab
        self.output_lang = hinglish_vocab
        #building vocabulary
        for english_sent in self.english:
            self.input_lang.process_sentence(english_sent)
        for hinglish_sent in self.hinglish:
            self.output_lang.process_sentence(hinglish_sent)
        # creating tensors
        self.hinglish_tensors =[tensor_from_sentence(hinglish_vocab,sentence) for sentence in self.hinglish]
        self.english_tensors = [tensor_from_sentence(english_vocab,sentence) for sentence in self.english]
        

    def __len__(self):
        return len(self.dataframe)

    def __getitem__(self,index):
        hinglish_sample = self.hinglish_tensors[index]
        english_sample = self.english_tensors[index]
        sample = {'input':english_sample,'output':hinglish_sample}

        return sample

2. Debug Log of getitem with shapes

train_dataset = TranslationDataset(train_df,english_vocab,hinglish_vocab)
item = train_dataset.__getitem__(5)
print(item['input'], item['input'].shape, item['output'],item['output'].shape)

STDOUT ::
tensor([[84, 27, 85, 86, 25, 87, 11, 17, 15, 88, 89, 90, 91, 92, 15, 93, 17, 29,
94, 23, 2]]) torch.Size([1, 21]) tensor([[ 66, 24, 81, 34, 110, 111, 112, 44, 113, 114, 115, 38, 116, 82,
117, 7, 44, 118, 119, 120, 85, 121, 2]]) torch.Size([1, 23])

3. Collater Class

#collate_function
class Collater(object):
    def __init__(self, pad_index):
        self.pad_index = pad_index

    def __call__(self, batch):

        input = [item['input'] for item in batch]
        output = [item['output'] for item in batch]
        input = pad_sequence(input, batch_first=False, padding_value=self.pad_index)
        output = pad_sequence(output, batch_first=False, padding_value=self.pad_index)

        item = {'input':input, 'output':output}
        return item

4. Usage of Collater class in collate_fn arg while creating dataloader

pad_idx = english_vocab.stoi["<PAD>"]
train_loader = DataLoader(train_dataset,batch_size=4,shuffle=True, num_workers=4, pin_memory=True,drop_last=True, collate_fn=Collater(pad_idx))#, collate_fn=Collater(0))

5. Debug Log of DataLoader not being able to ensure multiple batches

for batch in train_loader:
    print(batch['input'].shape, batch['output'].shape)
    break

STDOUT
input = pad_sequence(input, batch_first=False, padding_value=self.pad_index)
File “/home/suraj/anaconda3/envs/torch_dl/lib/python3.8/site-packages/torch/nn/utils/rnn.py”, line 398, in pad_sequence
return torch._C._nn.pad_sequence(sequences, batch_first, padding_value)
RuntimeError: The size of tensor a (27) must match the size of tensor b (10) at non-singleton dimension 1

Don’t want to try F.pad but use pad_sequences only since it kind of worked for image captioning task !

Thanks in advance for the assist !

Suraj520 · February 10, 2023, 11:22pm

Solved !
Apparently, It’s a silly mistake in get item.

hinglish_sample = self.hinglish_tensors[index][0]
english_sample = self.english_tensors[index][0]