What is the best way to read in data for transformer model

I have sequence data where both source and target are sequences of varied length. Example below if 3 cases where the skill_seq is my X and label is my Y. The data will be passed to a transformer model. Is the dataset - > dataloader approach the best way to read in and batch the data? Code below.

skill_seq=[[1,3,2,4], [4,4,4,5,5,5], [3,2, 1]]
label_seq = [[1,0,1,1], [1,1,1,0,0,1], [0,1,1]]

seq_df=pd.DataFrame(np.array(skill_seq), columns=['skills'])

from torch.utils.data import Dataset
class Train_Dataset(Dataset):
   def __init__(self, df, source_column, target_column, transform=None):
        self.df = df
        self.transform = transform
        #get source and target texts
        self.skill_seq = self.df[source_column]
        self.labels = self.df[target_column]
   def __len__(self):
        return len(self.df)
   def __getitem__(self, index):
        skill_seq = self.skill_seq[index]
        labels = self.labels[index]
        if self.transform is not None:
            source_text = self.transform(skill_seq)
        return torch.tensor(skill_seq), torch.tensor(labels) 

user_seqs = Train_Dataset(seq_df, 'skills', 'labels')
dkt_loader = torch.utils.data.DataLoader(user_seqs, batch_size=1, shuffle=True)

Main challenge I have here is the need to pad. I thought I could pad as part of the forward pass in the model. But it seems like the dataloader throws an error when the tensors are of different size at each sample. So need to figure out how to pad at the dataloader step within each batch rather then based on the whole dataset - maybe using collate_fn ()?

Have you had a look at the tutorial here?


That might give you some ideas on how to feed data in this case.

At any rate, you could just define your __getitem__ to add padding on the data and labels. Pass in arguments to the init of max_seq_len and pad_token, and then just do:

data = torch.tensor(skill_seq)
data_pad_len = max_seq_len - data.size(0)
data = torch.cat([data, torch.full((data_pad_len,), pad_token)])

Do the same for your labels.

1 Like

Thanks I did look at that example. I’m also going by the Deep Learning with Pytorch book. And so the example confused me from the standpoint of why was the Dataset-> Dataloader approach abandoned in favor of batchify().

It looks like something like this works, a bit more verbose, but I think this pads by batch.

def collate_fn(data):
  skill_seq, labels=zip(*data)
  seq_lengths = torch.LongTensor(list(map(len, skill_seq)))
  seq_tensor = Variable(torch.zeros((len(skill_seq), seq_lengths.max()))).long()
  for idx, (seq, seqlen) in enumerate(zip(skill_seq, seq_lengths)): 
     seq_tensor[idx, :seqlen] = torch.LongTensor(seq)

  label_tensor = Variable(torch.ones(len(label_seq), seq_lengths.max())).long()*-1
  for idx, (seq, seqlen) in enumerate(zip(label_seq, seq_lengths)): 
     label_tensor[idx, :seqlen] = torch.LongTensor(seq)
  return seq_tensor, label_tensor

dkt_loader = torch.utils.data.DataLoader(user_seqs, batch_size=3, collate_fn=collate_fn)

for epoch in range(1, n_epochs+1):
  for x,y in dkt_loader:
    query, key, values = model(x,y)

Right, but if you decide you need the padding completed per item, you’ll likely need to rewrite the custom DataSet __getitem__ function, as mentioned in my previous comment. The DataLoader takes that and runs it per worker, based on the num_workers set.

That makes sense, thanks for your help @J_Johnson