Why is there a `torch.cat` the in the collate function for a text dataloader?

00krishna · November 17, 2021, 8:24pm

I am looking at a pytorch tutorial for simple text classifier. The tutorial is simple enough, but what is confusing is the collate function when creating batches of data for training. I am not clear on why the tutorial uses torch.cat to combine the text lists.

So the code below will create a dataloader with the text label, and then the numericalized sentence itself. My question is, why do the authors of the tutorial use torch.cat() on the text_list in the batch. It seems like concatenating them will muddy or confuse the difference between each text. Like will the data go from being a batch of 8 texts, to a batch of just 1 big text?

from torchtext.datasets import AG_NEWS
from torchtext.data.utils import get_tokenizer
from torch.utils.data import DataLoader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = get_tokenizer('basic_english')
text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: int(x) - 1


def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    for (_label, _text) in batch:
         label_list.append(label_pipeline(_label))
         processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
         text_list.append(processed_text)
         offsets.append(processed_text.size(0))
    label_list = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list) ##     <-------------  QUESTION IS WHAT USE CAT HERE?
    return label_list.to(device), text_list.to(device), offsets.to(device)

train_iter = AG_NEWS(split='train')
dataloader = DataLoader(train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch)

I am asking this, because when I look at some sample data using next(iter(dataloader)), even I cannot tell where one data example ends and the next one begins. I forced the sequence length of my texts to be just 2, so that I can see the output better. There are 8 examples of length 2. Just by looking at the output it seems like there is just 1 vector that is 16 elements, instead of 8 vectors that are 2 elements in size. That could just be a limitation of the printing of tensors, but I wanted to make sure I was not doing anything weird.

next(iter(dataloader))

(tensor([ 1059,   454,   431,   425,    58,     8,    78,   798, 38487,   410,
          202,  1643,   272,  1197, 16858,    30]),
 tensor([3, 3, 3, 3, 3, 3, 3, 3]))

Hence, can anyone explain why the use of torch.cat here? Like is it necessary, or are there other ways to do this?

ptrblck · November 18, 2021, 7:26am

Your observation seems to be right and is also described in the linked tutorial:

In this example, the text entries in the original data batch input are packed into a list and concatenated as a single tensor for the input of nn.EmbeddingBag. The offset is a tensor of delimiters to represent the beginning index of the individual sequence in the text tensor. Label is a tensor saving the labels of individual text entries.
[…]
Although the text entries here have different lengths, nn.EmbeddingBag module requires no padding here since the text lengths are saved in offsets.

where the offsets are used in:

class TextClassificationModel(nn.Module):

    def __init__(self, vocab_size, embed_dim, num_class):
        super(TextClassificationModel, self).__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
        self.fc = nn.Linear(embed_dim, num_class)
        self.init_weights()

    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()

    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        return self.fc(embedded)

00krishna · November 18, 2021, 7:29pm

@ptrblck Thanks for your comment and validating my intuition here. I had not really seen this EmbeddingBag layer before, but I can see now that it actually combines the data batch into a long vector and then uses the offsets to identify the boundaries of each example. Haha, it is not a structure that I was accustomed to, since the regular Embedding layer does not work that way.

Does this EmbeddingBag layer work better than standard embedding layers? Seems like I can still use existing Glove embeddings and other downloaded embeddings with the EmbeddingBag too.

ptrblck · November 18, 2021, 8:01pm

This blog post explains the difference between these layers. I don’t know which layer would work better for which use cases, so you should definitely try out both