Text classification tutorial confusion about dataset

dev1 · February 4, 2023, 6:19pm

I’m following Text classification with the torchtext library

def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    for (_label, _text) in batch:
        label_list.append(label_pipeline(_label))
        processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
        text_list.append(processed_text)
        offsets.append(processed_text.size(0))
    
    label_list = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)
    return label_list.to(device), text_list.to(device), offsets.to(device)

train_iter = AG_NEWS(split='train')
dataloader = DataLoader(train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch)

after these operations I call
print(next(iter(dataloader))[1])

to get the text values for each batch. However I find out that it’s actually a tensor that merged all 8 sentences(batch size is 8) like they’re one. It gives me output 338 which is the number of tokens for this batch of 8. Is this a normal behavior?