I have sequence data where both source and target are sequences of varied length. Example below if 3 cases where the skill_seq is my X and label is my Y. The data will be passed to a transformer model. Is the dataset - > dataloader approach the best way to read in and batch the data? Code below.
Main challenge I have here is the need to pad. I thought I could pad as part of the forward pass in the model. But it seems like the dataloader throws an error when the tensors are of different size at each sample. So need to figure out how to pad at the dataloader step within each batch rather then based on the whole dataset - maybe using collate_fn ()?
That might give you some ideas on how to feed data in this case.
At any rate, you could just define your __getitem__ to add padding on the data and labels. Pass in arguments to the init of max_seq_len and pad_token, and then just do:
data = torch.tensor(skill_seq)
data_pad_len = max_seq_len - data.size(0)
data = torch.cat([data, torch.full((data_pad_len,), pad_token)])
Thanks I did look at that example. I’m also going by the Deep Learning with Pytorch book. And so the example confused me from the standpoint of why was the Dataset-> Dataloader approach abandoned in favor of batchify().
Right, but if you decide you need the padding completed per item, you’ll likely need to rewrite the custom DataSet __getitem__ function, as mentioned in my previous comment. The DataLoader takes that and runs it per worker, based on the num_workers set.