Hi guys, I’m super new to torchtext lib and currently learning.
I’ve read some tutorials of torchtext.data, but still not sure what should I do for some use-cases.
For example, an ordinary task is to train a word2vec model in NLP, and I’m not sure what’s the correct and most efficient way to prepare/load the training data for this model training task?
I tried the most naive approch, i.e. write an adhoc python script to pre-process the corpus into a tsv format, for example like this,
anarchism originated 1
anarchism as 1
anarchism a 1
anarchism term 1
anarchism of 1
anarchism race 0
anarchism one 0
anarchism from 0
anarchism details 0
anarchism hereditary 0
the 1st column is target word, 2nd is context word, 3nd is label (0 means negative sampling).
Then, tried to load this preprocessed corpus via torchtext.data.TabularDataset.
But this cannot work, since it took so much memory and time, and never finished.
Please provide some comments/answers, what is the right way to do this?
If you plan to use torchtext.data.TabularDataset, your data should be able to fit in the memory. Only then can torchtext read and create batches of data.
If the above is not possible, you can use lazy loading to create your custom batches (use yield like used in Python generators).
Thanks @Abhilash_Srivastava!
I think it’s the second situation for me at the moment, but the follow-up question is, if I use lazy loading, I’m afraid that some torchtext functionalities cannot be used, such as Field.build_vocab, right?
If so, I don’t need to use torchtext at the first place then.
That’s right. With the second option, torchtext is out of the picture. Hence, you’ll need to use create your vocab, along with stoi and itos dictionaries.
Sorry, you mean which one? Could you please drop a link?
I thought you were talking about the torchvision.datasets, but doesn’t feel this is correct here.