Padding text data from torchtext.datasets for Recurrence network

I am working with the torchtext.datasets datasets. I need to do classification using a recurrence network.

NGRAMS = 1
BATCH_SIZE = 20
train_dataset, test_dataset = text_classification.DATASETS['AG_NEWS'](
    root='./.data', ngrams=NGRAMS, vocab=None)

and the dataLoader

data = DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True,collate_fn=batching)

create batch function

def batching(batch):
    #print(batch.shape)
    label = torch.tensor([entry[0] for entry in batch])
    text = [entry[1] for entry in batch]
    textLength = torch.tensor([entry.shape for entry in text])
    max_text_length = torch.max(textLength)
    newtext = [torch.cat((torch.zeros([max_text_length-textLength[i]], dtype=torch.long), text[i]),0) for i in range(len(text))]
    new_textLength = torch.tensor([entry.shape for entry in newtext])
    text2=torch.stack(newtext)  #the padded tensor
    
    return text2, label

What I have in the function above is to ensure all the sentences in the batch are of equal lengths by padding them by 0(to the left).

  1. Am I doing the right thing?
  2. Do I need to add 0 to the vocabulary? How I add 0 to the vocabulary? Since the vocabulary is created automatically?
  3. Do I add start and end of sentence token? How do I add to the vocabulary?
  4. Lastly, do I need an offset?
  1. I personally don’t work with torchtext but do all the preprocessing myself; it’s not complicated and I can tweak it to my liking. I assume there are a bunch of tutorials out there.

  2. Strictly speaking, I don’t think that 0 (representing padding <PAD>) has to be in the vocabulary (you probably also need an index to represent unknown <UNK>) – although, it would probably the cleaner solution. However

    • you have to make sure the 0 does not already reference a real word in your existing vocabulary
    • your word embedding layer has to be large enough to cover all words in your vocabulary and the padding (and <UNK>, <SOS>, <EOS>, if needed)
  3. Since you do only classification, I don’t think you need <SOS> and/or <EOS>. Those are important particularly for the decoder in sequence-to-sequence models.

  4. No idea what you mean by offset :slight_smile:

If it’s useful, here’s the code to my Vocabulary class I use. Depending what extra indexes I need, I can create an initial one like:

vocabulary = Vocabulary(default_indexes={0: '<pad>', 1: '<unk>', 2: '<sos>', 3: '<eos>'})

or

vocabulary = Vocabulary(default_indexes={0: '<pad>', 1: '<unk>'})

if I just have a classifier and don’t need <sos> and <eos>. The first word I then add to the vocabulary then has index 4 or 2, respectively, and so on. Do you mean this with offset?

Anyway, I would assume torchtext can handle this smoothly as well. You might want to look for a sequence-to-sequence tutorial that uses torchtext, since there <sos> and <eos> are needed.

Thanks @vdw for the nice answer. I think offsets is required by EmbeddingBag where sequences are concatenated as a long sequence and offsets is used to separate individual sequences. Since you choose to pad sequences, it’s not really necessary to have offset here.

If your vocab comes with <PAD> token, you could get the pad id by

pad_id = train_dataset.get_vocab()['<pad>']

Sometime, you may even want to pad sequences with similar lengths. There is an issue post on torchtext to explain padding link.