Prepare data for Bi-LSTM text classifier

I want to build a binary text classifier with a Bi-LSTM.

Which is the best way to prepare data as input to the model? Especially, create the vocabulary of the dataset for using it with word embeddings.

Thanks in advance.

You should iterate over entire dataset to create your unique list of tokens(i.e. the vocab) . And assign a unique id to all of them. Next map your text input to these ids before passing to your model. Also, at the start of the model you should have an Embedding layer. Lookup nn.Embedding.from_pretrained for that purpose.

@Sagnik_Mukherjee thank you for your response.

Actually, I was looking for a more technical response. Searching for libraries to tokenise the dataset, build vocabulary etc I found torchtext package which, is implemented by Pytorch but because it is a different package, it has a different version (at the time of writing 0.8.0a0+c4a91f2) from Pytorch.

Is this package is one of the recommended ways to prepare data for the model or are there others that are more preferable?

p.s. since I will use GloVe embeddings your tip about nn.Embedding.from_pretrained was helpful.


I personally prefer spacy for tokenization. Whether in reality you should try out different tokenizers and report the best results.

I used torchtext for preparing data as input for the model.
It is very convenient for tokenization and building vocabulary. For tokenization, it has some build in options and includes spaCy tokinizer as well.
Even though torchtext provides also data loading, I used it just for data preparation and used Dataloader instead.

Thank you for your response,