Pre-processing text for transformer model for text classification

amjass · May 24, 2023, 4:03pm

Hi all,

I have a corpus of text that is 80 words long per ‘paragraph’ - each ‘paragraph’ is itself a training example and contains a label for classification. I am using the pytorch transformer documentation and am confused about the preprocessing in the example.

I am using - Language Modeling with nn.Transformer and torchtext — PyTorch Tutorials 2.0.1+cu117 documentation

I have done tokenization before and so this part is trivial - however, when it comes to converting the corpus of text (with each paragraph being a training example)

def data_process(raw_text_iter: dataset.IterableDataset) -> Tensor:
    """Converts raw text into a flat Tensor."""
    data = [torch.tensor(vocab(tokenizer(item)), dtype=torch.long) for item in raw_text_iter]
    return torch.cat(tuple(filter(lambda t: t.numel() > 0, data)))

I am confused as to why a flat tensor is created? Maybe someone can shed light on this. the example that is used is the WikiText2 data, which when previewing appears still to be split by paragraphs. The ‘data_process’ function however creates a flat tensor of length 80 * 500,000 paragraphs for my data - so its simply one very long tensor of numbers, suggesting this is one text? but this is not the case. I feel like now, context information as well as training example is completely lost! also, it is import to note that, word 80 and 81 are NOT necessarily found the same context and must be treated as individual training samples. Can someone shed light on what is going on here? and if i have misunderstood, how it is that this flat tensor still retains information about the individual texts?

many thanks!