Torchtext dataset: don't load the whole file into memory

ecolss · October 20, 2020, 10:42pm

Hi guys, I’m super new to torchtext lib and currently learning.
I’ve read some tutorials of torchtext.data, but still not sure what should I do for some use-cases.

For example, an ordinary task is to train a word2vec model in NLP, and I’m not sure what’s the correct and most efficient way to prepare/load the training data for this model training task?

I tried the most naive approch, i.e. write an adhoc python script to pre-process the corpus into a tsv format, for example like this,

anarchism	originated	1
anarchism	as	1
anarchism	a	1
anarchism	term	1
anarchism	of	1
anarchism	race	0
anarchism	one	0
anarchism	from	0
anarchism	details	0
anarchism	hereditary	0

the 1st column is target word, 2nd is context word, 3nd is label (0 means negative sampling).
Then, tried to load this preprocessed corpus via torchtext.data.TabularDataset.

But this cannot work, since it took so much memory and time, and never finished.
Please provide some comments/answers, what is the right way to do this?

Abhilash_Srivastava · October 21, 2020, 2:03am

Two things:

If you plan to use torchtext.data.TabularDataset, your data should be able to fit in the memory. Only then can torchtext read and create batches of data.
If the above is not possible, you can use lazy loading to create your custom batches (use yield like used in Python generators).

ecolss · October 21, 2020, 6:29am

Thanks @Abhilash_Srivastava!
I think it’s the second situation for me at the moment, but the follow-up question is, if I use lazy loading, I’m afraid that some torchtext functionalities cannot be used, such as Field.build_vocab, right?
If so, I don’t need to use torchtext at the first place then.

BramVanroy · October 21, 2020, 6:38am

You can have a look at the “datasets” library, which is highly optimized.

Abhilash_Srivastava · October 21, 2020, 8:14am

That’s right. With the second option, torchtext is out of the picture. Hence, you’ll need to use create your vocab, along with stoi and itos dictionaries.

ecolss · October 21, 2020, 4:04pm

Sorry, you mean which one? Could you please drop a link?
I thought you were talking about the torchvision.datasets, but doesn’t feel this is correct here.

BramVanroy · October 22, 2020, 7:54am

Ths one. It uses the Apache Arrow format which allows incredible speed with a low memory footprint.

ecolss · October 22, 2020, 10:17am

Thanks, surely will try it!