In my courses, I share prepared Jupyter notebooks with my students. All notebooks that handle textual data currently still use torchtext (e.g., this one and this one. Now that that torchtext is no longer maintained/developed, I would like to “refresh” the notebooks to remove any use of torchtext.
What would be the recommended best practice? I basically used torchtext only for building the vocabulary, and then transforming tokens/words to their respective indices, and vice versa
Before using torchtext, I actually used my own implementation for that, so I know how to do it. But I would prefer to stick to popular libraries/tools/etc. and best practices to streamline my notebooks and keep the code clean.
Realize this is an old post, but seeing no responses: I have also been looking for a replacement with no luck. My current workflow is to use SpaCy which handles most of the plumbing tasks (e.g., tokeninzation) and then handling the conversion to tensors on my own in PyTorch Lightning DataModules.
I only ever used torchtext to create vocabularies and then convert list of tokens to list of indices. Any preprocessing such as tokenization (well, basic word-level tokenization), lemmatization, etc. I always do with spaCy as well.
I therefore simply do it myself now with a custom class which is simple, short, and easy to extend if needed. Here is a notebook explaining the individual steps (it’s an education notebook) and here is the Vocabulary class is in here.