I’m following Text classification with the torchtext library
def collate_batch(batch):
label_list, text_list, offsets = [], [], [0]
for (_label, _text) in batch:
label_list.append(label_pipeline(_label))
processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
text_list.append(processed_text)
offsets.append(processed_text.size(0))
label_list = torch.tensor(label_list, dtype=torch.int64)
offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
text_list = torch.cat(text_list)
return label_list.to(device), text_list.to(device), offsets.to(device)
train_iter = AG_NEWS(split='train')
dataloader = DataLoader(train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch)
after these operations I call
print(next(iter(dataloader))[1])
to get the text values for each batch. However I find out that it’s actually a tensor that merged all 8 sentences(batch size is 8) like they’re one. It gives me output 338 which is the number of tokens for this batch of 8. Is this a normal behavior?