Batching in DataLoader for Text Supervised Data

I am working on Quora duplicate dataset that has 3 columns: ‘Question 1’, ‘Question 2’, ‘duplicate flag’.

I want to use DataLoader to speed up the model. Any idea on how to implement batching in data loader when you have 2 text of different lengths.

PS: Using fastext exbeddings for each word

I created the Dataset as follows:-

import torch.utils.data as Data
from torch.utils.data import Dataset

class QuoraDataset(Dataset):
    def __init__(self, csv, transform=None, separator = ','):
        df = pd.read_csv(csv, sep=separator)

        df['question1'] = df['question1'].astype(str)
        self.q_1 = [i.split(' ') for i in df['question1'].values.tolist()]

        df['question2'] = df['question2'].astype(str)
        self.q_2 = [i.split(' ') for i in train['question2'].values.tolist()]

        self.label = df['is_duplicate'].values.tolist()

        self.length = len(df)
        self.transform = transform

       
    def __len__(self):
        return self.length
    
    def __getitem__(self,index):

        inputs1 = torch.tensor([get_ft(word) for word in self.q_1[index]])
        inputs2 = torch.tensor([get_ft(word) for word in self.q_2[index]])

        sample = {'x1':self.q_1, 'x2':self.q_2, 'y':self.label[index]}
        return sample

You could write a custom collate_fn, which will create the batches with the variable length data samples.
Have a look at this example.

@ptrblck Thanks for the response. I read that post and wrote a custom collate_fn which is as follows: -

#build dataloaders
def collate_fn_padd(batch):


  batch_x1 = [t['x1'] for t in batch]
  batch_x2 = [t['x2'] for t in batch]
  batch_y = [t['y'] for t in batch]

  lengths_x1 = torch.tensor([t.shape[0] for t in batch_x1])
  lengths_x2 = torch.tensor([t.shape[0] for t in batch_x2])

  from torch.nn.utils.rnn import pad_sequence

  b_x1 = pad_sequence(batch_x1, batch_first=True, padding_value=0)
  b_x2 = pad_sequence(batch_x2, batch_first=True, padding_value=0)
  batch_y = torch.tensor(batch_y)

  return b_x1, b_x2, lengths_x1, lengths_x2, batch_y

batch_size = 2
train_dl = Data.DataLoader(train_ds, batch_size=batch_size, shuffle=False, collate_fn=collate_fn_padd)

The problem starts now. I have a model that takes 2 sentences and runs LSTM on them which is as follows: -

class LR(nn.Module):
    def __init__(self, embedding_dim, hidden_dim):
        super(LR, self).__init__()

        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers = LSTM_LAYERS, dropout = DROPOUT, bidirectional = True)
        self.Bilinear = nn.Bilinear(2*LSTM_LAYERS*HIDDEN_DIM, 2*LSTM_LAYERS*HIDDEN_DIM, 1, bias=False)

    def forward(self, sentence1, sentence2):
        
        _, (hidden_state1, _) = self.lstm(sentence1.view(len(sentence1), 1, -1))
        _, (hidden_state2, _) = self.lstm(sentence2.view(len(sentence2), 1, -1))
        
        h1 = hidden_state1.view(-1)
        h2 = hidden_state2.view(-1)
        
        y_predict = torch.sigmoid(self.Bilinear(h1,h2))

        return y_predict

As the model uses LSTM, I was trying to pack the 2 batches ‘x1’ and ‘x2’ separately. The function pack_padded_sequence requires the each element (2 sentences for a batch_size = 2 in my case) of a batch to be sorted by the length of element in decreasing order. I have a 2 fold confusion here: -

  1. The elements of the each batch 'x1 and ‘x2’ can be totally different in lengths. The y_batch which is common to both ‘x1’ and ‘x2’ would then can have a different order for ‘x1’ and ‘x2’. Any method to resolve this dilemma?
  2. When passing the data in batches for the above case, is order relevant anymore?