data = DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True,collate_fn=batching)
create batch function
def batching(batch):
#print(batch.shape)
label = torch.tensor([entry[0] for entry in batch])
text = [entry[1] for entry in batch]
textLength = torch.tensor([entry.shape for entry in text])
max_text_length = torch.max(textLength)
newtext = [torch.cat((torch.zeros([max_text_length-textLength[i]], dtype=torch.long), text[i]),0) for i in range(len(text))]
new_textLength = torch.tensor([entry.shape for entry in newtext])
text2=torch.stack(newtext) #the padded tensor
return text2, label
What I have in the function above is to ensure all the sentences in the batch are of equal lengths by padding them by 0(to the left).
Am I doing the right thing?
Do I need to add 0 to the vocabulary? How I add 0 to the vocabulary? Since the vocabulary is created automatically?
Do I add start and end of sentence token? How do I add to the vocabulary?
I personally don’t work with torchtext but do all the preprocessing myself; it’s not complicated and I can tweak it to my liking. I assume there are a bunch of tutorials out there.
Strictly speaking, I don’t think that 0 (representing padding <PAD>) has to be in the vocabulary (you probably also need an index to represent unknown <UNK>) – although, it would probably the cleaner solution. However
you have to make sure the 0 does not already reference a real word in your existing vocabulary
your word embedding layer has to be large enough to cover all words in your vocabulary and the padding (and <UNK>, <SOS>, <EOS>, if needed)
Since you do only classification, I don’t think you need <SOS> and/or <EOS>. Those are important particularly for the decoder in sequence-to-sequence models.
No idea what you mean by offset
If it’s useful, here’s the code to my Vocabulary class I use. Depending what extra indexes I need, I can create an initial one like:
if I just have a classifier and don’t need <sos> and <eos>. The first word I then add to the vocabulary then has index 4 or 2, respectively, and so on. Do you mean this with offset?
Anyway, I would assume torchtext can handle this smoothly as well. You might want to look for a sequence-to-sequence tutorial that uses torchtext, since there <sos> and <eos> are needed.
Thanks @vdw for the nice answer. I think offsets is required by EmbeddingBag where sequences are concatenated as a long sequence and offsets is used to separate individual sequences. Since you choose to pad sequences, it’s not really necessary to have offset here.
If your vocab comes with <PAD> token, you could get the pad id by
pad_id = train_dataset.get_vocab()['<pad>']
Sometime, you may even want to pad sequences with similar lengths. There is an issue post on torchtext to explain padding link.